Data pipeline with DLT, dbt, Prefect, and ClickHouse | Candido Sales

Lately, I’ve been studying Data Engineering and Data Pipelines. I thought it would be interesting to share a bit of what I’m learning.

I wanted to create a simple proof of concept for a data pipeline that could be easily replicated in any environment using Docker, where I could set up the entire environment using Docker Compose and incorporate software engineering concepts using dbt.

Furthermore, one of the principles of this architecture is cost, so I researched those that offered the greatest simplicity and lowest possible cost. Thus, I chose to use Prefect as the pipeline orchestrator, dlt, and ClickHouse as the data warehouse to store the transformed data.

To make it a bit more challenging, I wanted to ingest data from an MS SQL Server database, which is a database I don’t usually work with day-to-day. The dataset would be about New York taxis, which is a public and well-known dataset in the data community.

The solution’s architecture looks like this:

architecture

Create the environment with Docker Compose

The first step is to create the environment with Docker Compose, where I will start the source database. I created the docker-compose.yml file in the root folder of the project with the following content:

1
services:
2
  sqlserver:
3
    image: mcr.microsoft.com/mssql/server:2022-latest
4
    container_name: sqlserver
5
    environment:
6
      ACCEPT_EULA: 'Y'
7
      MSSQL_SA_PASSWORD: 'YourStrong!Passw0rd'
8
    ports:
9
      - '1433:1433'
10
    volumes:
11
      - sqlserverdata:/var/opt/mssql

Then I initialize the container:

1
docker-compose up -d

Import data into SQL Server

I downloaded the data from the NYC Taxi & Limousine Commission (TLC) Trip Record Data through this link, and saved the file in the dataset folder of my project.

To import the .bak backup file into the SQL Server container, I use the following command:

1
docker cp ./dataset/NYCTaxi_Sample.bak sqlserver:/var/opt/mssql/data/NYCTaxi_Sample.bak

Next, I execute the database restore command:

1
docker exec -it sqlserver /opt/mssql-tools18/bin/sqlcmd \
2
   -S localhost -U sa -P 'YourStrong!Passw0rd' -C \
3
   -Q 'RESTORE DATABASE NYCTaxi_Sample FROM DISK = "/var/opt/mssql/data/NYCTaxi_Sample.bak" WITH MOVE "NYCTaxi_Sample" TO "/var/opt/mssql/data/NYCTaxi_Sample.mdf", MOVE "NYCTaxi_Sample_log" TO "/var/opt/mssql/data/NYCTaxi_Sample_log.ldf"'

Create project structure

Now that the database is ready, I will create the project structure with the tools I will use: DLT, ClickHouse, dbt, and Prefect.

First, I organize the project folder structure:

1
data-engineer/
2
├── dataset/            # Raw data backups
3
├── nyc_taxi/
4
│   ├── main_flow.py    # Prefect orchestrator
5
│   ├── extract_sqlserver.py  # DLT ingestion logic
6
│   └── nyc_taxi_dbt/   # dbt project
7
│       ├── models/     # SQL Transformation models
8
│       └── profiles.yml # dbt connection settings
9
└── docker-compose.yaml # Infrastructure definition

Inside the nyc_taxi folder, I create the Python virtual environment and install the necessary dependencies using UV. UV is a dependency and virtual environment management tool for Python projects built in Rust (I highly recommend checking it out).

1
brew install uv

Then, I initialize the UV environment:

1
uv init nyc_taxi

I add the project dependencies:

1
uv add dbt-core dbt-sqlserver dbt-clickhouse prefect prefect-client

Next, create the dbt project:

Configure dbt

Dbt (Data Build Tool) is a data transformation tool that allows data engineers and analysts to transform, test, and document data in their data warehouses. To create the Dbt project, I navigate to the nyc_taxi folder and run the command:

1
cd nyc_taxi
2
dbt init nyc_taxi_dbt

My dbt_project.yml file looks like this:

1
# Name your project! Project names should contain only lowercase characters
2
# and underscores. A good package name should reflect your organization's
3
# name or the intended use of these models
4
name: 'nyc_taxi_dbt'
5
version: '1.0.0'
6

7
# This setting configures which "profile" dbt uses for this project.
8
profile: 'nyc_taxi_clickhouse'
9

10
# These configurations specify where dbt should look for different types of files.
11
# The `model-paths` config, for example, states that models in this project can be
12
# found in the "models/" directory. You probably won't need to change these!
13
model-paths: ['models']
14
analysis-paths: ['analyses']
15
test-paths: ['tests']
16
seed-paths: ['seeds']
17
macro-paths: ['macros']
18
snapshot-paths: ['snapshots']
19

20
clean-targets: # directories to be removed by `dbt clean`
21
  - 'target'
22
  - 'dbt_packages'
23

24
# Configuring models
25
# Full documentation: https://docs.getdbt.com/docs/configuring-models
26

27
# In this example config, we tell dbt to build all models in the example/
28
# directory as views. These settings can be overridden in the individual model
29
# files using the `{{ config(...) }}` macro.
30
models:
31
  nyc_taxi_dbt:
32
    # Config indicated by + and applies to all files under models/example/
33
    example:
34
      +materialized: view

I configure the ~/.dbt/profiles.yml file to connect to ClickHouse:

1
nyc_taxi_clickhouse:
2
  target: dev
3
  outputs:
4
    dev:
5
      type: clickhouse
6
      host: localhost
7
      port: 8123
8
      user: default
9
      password: password
10
      schema: nyc_taxi
11
      threads: 4

You can test if the dbt project is working correctly by running the command:

1
uv run dbt debug

Configure dbt models

I will create the dbt models to transform the data. I create the necessary folders and files inside the nyc_taxi/nyc_taxi_dbt/models/ folder:

1
data-engineer/
2
├── nyc_taxi/
3
│   └── nyc_taxi_dbt/   # dbt project
4
│       └── models/     # SQL Transformation models
5
│           ├── staging/
6
│           │   ├── sources.yml
7
│           │   └── stg_nyctaxi_sample.sql
8
│           └── marts/
9
│               └── fact_nyctaxi_trips.sql

The sources.yml file defines the data source:

1
version: 2
2

3
sources:
4
  - name: clickhouse_staging
5
    database: nyc_taxi
6
    tables:
7
      - name: nyctaxi_sample
8
        identifier: nyc_taxi_staging___nyctaxi_sample

The stg_nyctaxi_sample.sql file creates the staging table:

1
{{ config(materialized='view') }}
2

3
with source as (
4
    select * from {{ source('clickhouse_staging', 'nyctaxi_sample') }}
5
),
6

7
renamed as (
8
    select
9
        medallion,
10
        hack_license,
11
        vendor_id,
12
        rate_code,
13
        store_and_fwd_flag,
14
        pickup_datetime,
15
        dropoff_datetime,
16
        passenger_count,
17
        trip_time_in_secs,
18
        trip_distance,
19
        pickup_longitude,
20
        pickup_latitude,
21
        dropoff_longitude,
22
        dropoff_latitude,
23
        payment_type,
24
        fare_amount,
25
        surcharge,
26
        mta_tax,
27
        tolls_amount,
28
        total_amount,
29
        tip_amount,
30
        tipped,
31
        tip_class,
32
        _dlt_load_id,
33
        _dlt_id
34
    from source
35
)
36

37
select * from renamed

The fact_nyctaxi_trips.sql file creates the fact table:

1
{{ config(
2
    materialized='incremental',
3
    engine='ReplacingMergeTree',
4
    order_by=['medallion', 'hack_license', 'pickup_datetime'],
5
    unique_key='_dlt_id',
6
    incremental_strategy='append'
7
) }}
8

9
-- ReplacingMergeTree handles duplicates automatically based on the ORDER BY keys
10
-- when merges happen. In dbt-clickhouse, 'incremental' with 'append' is often used
11
-- with ReplacingMergeTree to let the engine handle deduplication.
12

13
with staging as (
14
    select * from {{ ref('stg_nyctaxi_sample') }}
15
    {% if is_incremental() %}
16
    where pickup_datetime > (select max(pickup_datetime) from {{ this }})
17
    {% endif %}
18
)
19

20
select * from staging

The reason for using ReplacingMergeTree is that it allows ClickHouse to automatically manage the replacement of duplicate records based on the defined primary key (in this case, medallion, hack_license, and pickup_datetime). This is especially useful for incremental load scenarios where new data might contain updates or corrections for existing records.

Additionally, I split it into staging and marts to follow the best practices of dbt project organization, where raw data is first loaded into staging tables before being transformed into fact or dimension tables.

If you want to learn more about modeling strategies like dimensions and facts, you can read this article.

So, illustrating the data flow, it would be:

1
[Raw Data in ClickHouse] --> [StagingTable: stg_nyctaxi_sample] --> [FactTable: fact_nyctaxi_trips]

With this, I finish the dbt configuration. We can test the models by running the command:

1
uv run dbt run

Configure Docker Compose for ClickHouse and Prefect

Now I will add the ClickHouse and Prefect services to the docker-compose.yml file:

1
services:
2
  sqlserver: # Below the already created SQL Server service
3
  clickhouse:
4
    image: clickhouse/clickhouse-server
5
    container_name: clickhouse
6
    environment:
7
      CLICKHOUSE_USER: default
8
      CLICKHOUSE_PASSWORD: password
9
      CLICKHOUSE_DB: nyc_taxi
10
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: 1
11
    ports:
12
      - '8123:8123'
13
      - '9000:9000'
14
    volumes:
15
      - clickhousedata:/var/lib/clickhouse
16
  prefect:
17
    image: prefecthq/prefect:3-python3.12
18
    container_name: prefect
19
    environment:
20
      PREFECT_SERVER_API_HOST: 0.0.0.0
21
    command: prefect server start --no-services
22
    ports:
23
      - '4200:4200'
24
    volumes:
25
      - prefectdata:/var/lib/prefect
26
volumes:
27
  sqlserverdata:
28
  clickhousedata:
29
  prefectdata:

Let’s start the ClickHouse and Prefect containers:

1
docker-compose up -d clickhouse prefect

Create the orchestration flow with Prefect

Now I will create the orchestration flow using Prefect. I create the main_flow.py file inside the nyc_taxi folder with the following content:

1
import logging
2
from prefect import flow, task
3
from extract_sqlserver import load_sql_server_to_clickhouse
4
from prefect_dbt import PrefectDbtRunner, PrefectDbtSettings
5

6
# Configure logging
7
logging.basicConfig(level=logging.INFO)
8
logger = logging.getLogger(__name__)
9

10
@task(retries=3, retry_delay_seconds=60)
11
def extract_task():
12
    logger.info("Starting extraction task...")
13
    load_sql_server_to_clickhouse()
14
    logger.info("Extraction task completed.")
15

16
@task
17
def dbt_run_task():
18
    logger.info("Starting dbt run task...")
19
    result = PrefectDbtRunner(
20
        settings=PrefectDbtSettings(
21
            project_dir="nyc_taxi_dbt",
22
            profiles_dir="nyc_taxi_dbt"
23
        )
24
    ).invoke(["build"])
25
    logger.info("dbt run task completed.")
26

27
@task
28
def data_quality_checks():
29
    logger.info("Running data quality checks...")
30
    # This could be more dbt tests or custom SQL checks
31
    # For now, we'll assume dbt build (which includes tests) covers this.
32
    logger.info("Data quality checks passed.")
33

34
@flow(name="nyc_taxi_etl")
35
def nyc_taxi_pipeline():
36
    extract_task()
37
    dbt_run_task()
38
    data_quality_checks()
39
    # Alerting can be handled by Prefect's native automation or state handlers
40
    logger.info("Pipeline completed successfully.")
41

42
if __name__ == "__main__":
43
    nyc_taxi_pipeline.serve(name="nyc_taxi_pipeline", cron="0 */12 * * *")

Next, I create the extract_sqlserver.py file with the logic to extract data from SQL Server to ClickHouse:

1
import logging
2
import dlt
3
from dlt.sources.sql_database import sql_database
4

5
def load_sql_server_to_clickhouse():
6
    # Configure logging to see what's happening
7
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
8
    logger = logging.getLogger(__name__)
9

10
    logger.info("Starting SQL Server to ClickHouse extraction...")
11
    # Configure the destination: ClickHouse
12
    # dlt will automatically pick up credentials from .dlt/secrets.toml
13
    pipeline = dlt.pipeline(
14
        pipeline_name="sql_server_to_clickhouse",
15
        destination="clickhouse"
16
    )
17

18
    # Define the source with incremental logic
19
    # dlt will automatically pick up credentials for sql_database from .dlt/secrets.toml
20
    source = sql_database().with_resources("nyctaxi_sample")
21

22
    # Configure incremental loading and primary keys for the resource
23
    # To use ReplacingMergeTree in ClickHouse, we need a primary key.
24
    # medallion + hack_license + pickup_datetime seem like a good candidate for a unique key in this sample.
25
    source.nyctaxi_sample.apply_hints(
26
        incremental=dlt.sources.incremental("pickup_datetime"),
27
        primary_key=["medallion", "hack_license", "pickup_datetime"]
28
    )
29

30
    # Run the pipeline
31
    # Use 'merge' to support updates/CDC and trigger ReplacingMergeTree.
32
    info = pipeline.run(source, write_disposition="merge")
33

34
    print(info)
35

36
if __name__ == "__main__":
37
    load_sql_server_to_clickhouse()

With this, let’s run the orchestration flow inside the nyc_taxi folder where the UV virtual environment is located:

1
cd nyc_taxi
2
uv run python main_flow.py

You can monitor the flow execution by accessing the Prefect dashboard at http://localhost:4200.

prefect-runs

prefect

In the flow logs, you will see the data extraction, transformation, and loading steps:

running-pipeline

You can check the data loaded into ClickHouse using the web client at http://localhost:8123 or any SQL query tool compatible with ClickHouse.

Here you can see the fact table fact_nyctaxi_trips created in ClickHouse: clickhouse-running

Configure ClickHouse UI

To facilitate data visualization in ClickHouse, you can use ClickHouse UI.

Let’s add the ClickHouse UI service to the docker-compose.yml file:

1
  ch-ui:
2
    image: ghcr.io/caioricciuti/ch-ui:latest
3
    restart: always
4
    ports:
5
      - '5521:5521'
6
    environment:
7
      # Core ClickHouse Configuration
8
      VITE_CLICKHOUSE_URL: 'http://localhost:8123'
9
      VITE_CLICKHOUSE_USER: 'default'
10
      VITE_CLICKHOUSE_PASS: 'password'
11

12
      # Optional: Advanced Features
13
      VITE_CLICKHOUSE_USE_ADVANCED: 'false'
14
      VITE_CLICKHOUSE_CUSTOM_PATH: ''
15
      VITE_CLICKHOUSE_REQUEST_TIMEOUT: '30000'
16

17
      # Optional: Reverse Proxy Support
18
      VITE_BASE_PATH: '/'

Then, I start the ClickHouse UI container:

1
docker-compose up -d ch-ui

You can access the ClickHouse UI interface at http://localhost:5521 to explore the loaded data. In the image below, you can see that I ran a query on the fact_nyctaxi_trips fact table to count all recorded trips, and it took only 1.18ms for the 1,703,957 records loaded. It’s very fast!:

clickhouse-ui

Conclusion

It is very rewarding to see how all these tools can work together to create an efficient and scalable data pipeline. The use of DLT for ingestion, dbt for transformation, Prefect for orchestration, and ClickHouse as a data warehouse provides a robust and high-performance solution that can be easily replicated in different environments thanks to Docker Compose.

There are other orchestration tools like Dagster or Airflow, but I chose Prefect for its simplicity and ease of use, as well as its execution being carried out in the codebase itself.

Additionally, for data ingestion, there is Airbyte, but I chose DLT because it offers a modern, efficient approach and utilizes Apache Arrow, especially when combined with ClickHouse, which is known for its speed and ability to handle large volumes of data.

Regarding dbt, I could have used dbt fusion, which is a new version developed in Rust that utilizes Apache Arrow, but it is still in beta, so I opted for traditional dbt.

Apache Arrow is a powerful technology that is gaining more and more space in the data ecosystem, and it is interesting to see how it is being integrated into various modern data engineering tools.

The Arrow format is a better way to represent tabular data in memory than native Python objects (list of dictionaries). It allows offloading processing to Arrow’s fast C++ library and avoids row-by-row processing. If you are interested in understanding more about it, I recommend this video.

You can check the repository with all the code used in this project on my GitHub, in addition to finding examples using Airflow and Dagster: data-engineer-dlt-dbt-prefect-clickhouse