The page provides users the capability to create a new DBT project and select the operator for executing models through the orcestrator. You can run model by using Kubernetes, or create cluster in Google and run models here or by reading DBT API server.
The choice between the GKEStartPodOperator and the KubernetesPodOperator depends on your specific use case and the environment in which your Kubernetes clusters are deployed. If you are using Google Kubernetes Engine (GKE) and want seamless integration with Google Cloud Platform (GCP), the GKEStartPodOperator may be the preferred option. However, if you need more flexibility to target different Kubernetes clusters or environments, the KubernetesPodOperator provides a more generic solution.
The choice between the DBT API and Kubernetes Operator depends on your specific use case, requirements. DBT API server allows you to interact with DBT via HTTP requests, while the KubernetesPodOperator executes DBT commands directly within a Kubernetes pod. The KubernetesPodOperator leverages Kubernetes for container orchestration, allowing you to scale and manage resources dynamically based on workload demands. If you need fine-grained control over resource allocation, scaling, and scheduling of DBT jobs, running DBT commands as Kubernetes pods may be more appropriate. Kubernetes provides powerful features for resource management, auto-scaling, and fault tolerance, which can be beneficial for large-scale or mission-critical deployments.
Kubernetes Operators in Apache Airflow provide a powerful mechanism for running and managing Airflow tasks on Kubernetes, leveraging Kubernetes' scalability, flexibility, and resource management capabilities. This operator creates run tasks as Kubernetes pods and allows to define tasks that run containerized workloads on Kubernetes, providing flexibility in managing dependencies and orchestrating complex workflows. With the KubernetesExecutor, Apache Airflow can dynamically scale the number of worker pods based on the workload. This allows Airflow to efficiently handle varying task loads and distribute work across multiple pods in a Kubernetes cluster.
Apache Airflow organizes tasks into queues, and each queue can have its own set of worker pods responsible for executing tasks. The KubernetesExecutor is one of the available executors in Airflow, and it works by spinning up Kubernetes pods to run Airflow tasks.
When a task is scheduled to run, the KubernetesExecutor checks the current workload and determines whether additional worker pods are needed to handle the task. If the existing worker pods are busy or if there is a backlog of tasks in the queue, the KubernetesExecutor dynamically creates new worker pods to scale up the pool of available resources.
The KubernetesExecutor leverages Kubernetes' native resource management capabilities to allocate resources (CPU, memory) to the newly created worker pods. This ensures that each pod has the necessary resources to execute tasks efficiently without overloading the underlying Kubernetes cluster.
Once the worker pods are up and running, they begin executing tasks assigned to them. Kubernetes manages the lifecycle of these pods, including scheduling, scaling, and termination, based on the workload and resource availability.
Similarly, when the workload decreases or when tasks are completed, the KubernetesExecutor may scale down the number of worker pods to optimize resource utilization. Unused worker pods are gracefully terminated, freeing up resources for other tasks or applications running on the Kubernetes cluster.
The dynamic scaling behavior of the KubernetesExecutor can be configured in Airflow's configuration file (airflow.cfg
). We can specify parameters such as the minimum and maximum number of worker pods, scaling policies, and resource constraints to tailor the scaling behavior to specific use case and workload requirements.
The GKEStartPodOperator in Apache Airflow allows to start a Kubernetes pod in a Google Kubernetes Engine (GKE) cluster and Google Cloud's managed Kubernetes service. It abstracts away the complexity of managing Kubernetes clusters, allowing you to focus on deploying and running your applications. Similar to KubernetesPodOperator, the GKEStartPodOperator provides flexibility in defining and configuring the Kubernetes pod that you want to start.
We can specify pod resources, environment variables, volumes, and other Kubernetes pod settings.
With the GKEStartPodOperator, we define tasks in Airflow DAGs that start Kubernetes pods on GKE. This allows you to execute containerized workloads in a scalable and managed Kubernetes environment provided by GKE.
The GKEStartPodOperator handles authentication with Google Cloud Platform (GCP) to access GKE resources. It uses Google Application Default Credentials (ADC) or service account credentials to authenticate with GCP.
We need to specify configuration parameters for the GKEStartPodOperator to customize the behavior of the operator according to requirements and environment that required for starting pods on Google Kubernetes Engine.
project_id
: Your GCP project ID.location
: The GCP region or zone where your GKE cluster is located.cluster_name
: The name of your GKE cluster.name
: The name of the pod to start.namespace
: The Kubernetes namespace where the pod will be created.image
: The container image to run in the pod. In image we define dbt command that will be executearguments
(optional): The arguments to pass to the command.labels
(optional): Additional labels to apply to the pod.is_delete_operator_pod
(optional): Whether to delete the pod once the task completes (default is True).The GKEStartPodOperator allows to specify resource requests and limits for the pods it starts, ensuring that they have the necessary CPU and memory resources to execute tasks efficiently.
DBT provides API server that allows you to programmatically interact with DBT using HTTP requests. In this case we deploy the DBT API server alongside your DBT installation and use it to trigger DBT runs, query job statuses, and perform other actions programmatically. The DBT API server is designed to handle concurrent requests and scale with your workload. It can be deployed in a scalable and high-availability configuration, allowing you to handle large volumes of DBT jobs and runs without experiencing performance degradation.
The DBT API server exposes a RESTful API that follows standard HTTP conventions. It uses HTTP methods such as GET, POST, DELETE to perform operations on DBT resources. You can use any HTTP client library or tool to interact with the API. Also server exposes endpoints for various DBT resources, including projects, models, runs, artifacts, and logs.
The DBT API server supports various authentication methods, including API tokens, OAuth, and basic authentication. You'll need to authenticate with the API server using valid credentials or tokens to access protected resources and perform actions.
In our platform we trigger runs by sending a POST request to the API:
@app.post("/invocations")
- This endpoint processes a user's request to execute a dbt command by a dbt worker. Upon receiving the request, it creates a task in the task queue. The process waits until the task completes (get status 200), after which we call the next API get("/invocations/{task_id}
.
@app.get("/invocations/{task_id}")
- This endpoint retrieves the invocation entity associated with the specified {task_id}. By calling this API, you obtain the result of the executed task, which can later be viewed in the Airflow task log.