Encounter with DuckDB

I get familiar with duckdb

Created: by Pradeep Gowda Updated: Nov 04, 2023 Tagged: duckdb

Installation

$ nix-env -i duckdb

I downloaded the NYC Yellow Taxi Data to play with DuckDB.

$ ls -1 *.parquet | tail -2
yellow_tripdata_2022-05.parquet
yellow_tripdata_2022-06.parquet

$ ls *.parquet | wc -l
     162

I decided to load only1 one year’s worth of data:

$ du -ch yellow_tripdata_2011-*.parquet
...
2.1G total

Used the pqrs tool to inspect the schema of the parquet files.

We can query the parquet files even without first loading them into a file. In the following steps, I launched duckdb without specifying a filename, which means the entire operation happened in memory (with likely spillover to the default tmp location).

Enabling profiling PRAGMA enable_profiling is a cool way to see what happens behind the scenes.

Note: D is DuckDB’s “prompt”, much like $ for shells or > is for SQLite.

$ duckdb
v0.0.1-dev0

D PRAGMA enable_profiling;
D SELECT count(*) FROM '*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
SELECT count(*) FROM '*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 0.202s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│    UNGROUPED_AGGREGATE    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        count_star()       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (0.05s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│       EC=2388294882       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         1678919579        │
│          (1.22s)          │
└───────────────────────────┘



┌──────────────┐
│ count_star() │
├──────────────┤
│ 1678919579   │
└──────────────┘

We see that duckdb was able to load all the 2021 files (1,678,919,579 ~ 1.6 billion records!) and do a COUNT(*) in 1.22 seconds.

In the next step I’ll save this parquet data into a table:



D create table nycyellowtaxi as select * from 'yellow_tripdata_2021*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
create table nycyellowtaxi as select * from 'yellow_tripdata_2021*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 5.88s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      CREATE_TABLE_AS      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (41.28s)         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          VendorID         │
│    tpep_pickup_datetime   │
│   tpep_dropoff_datetime   │
│      passenger_count      │
│       trip_distance       │
│         RatecodeID        │
│     store_and_fwd_flag    │
│        PULocationID       │
│        DOLocationID       │
│        payment_type       │
│        fare_amount        │
│           extra           │
│          mta_tax          │
│         tip_amount        │
│        tolls_amount       │
│   improvement_surcharge   │
│        total_amount       │
│    congestion_surcharge   │
│        airport_fee        │
│        EC=30085308        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          30904308         │
│          (6.62s)          │
└───────────────────────────┘
30 million trips in 2021

File size of nyc.db
1.7G
also confirmed by

D PRAGMA database_size;
┌───────────────┬────────────┬──────────────┬─────────────┬─────────────┬──────────┬──────────────┬──────────────┐
│ database_size │ block_size │ total_blocks │ used_blocks │ free_blocks │ wal_size │ memory_usage │ memory_limit │
├───────────────┼────────────┼──────────────┼─────────────┼─────────────┼──────────┼──────────────┼──────────────┤
│ 1.7GB         │ 262144     │ 6759         │ 6759        │ 0           │ 0 bytes  │ 0 bytes      │ 27.4GB       │
└───────────────┴────────────┴──────────────┴─────────────┴─────────────┴──────────┴──────────────┴──────────────┘

The PRAGMA database_size produces the database stats on disk.

This is pretty cool.

2022-10-20: more to come.


  1. Difference between yellow_tripdata_2009-11.parquet and yellow_tripdata_2021-11.parquet, the latter has VendorID field, while the former has vendor_name creating issues in importing all files into one duckdb table.↩︎