Installation
$ nix-env -i duckdb
I downloaded the NYC Yellow Taxi Data to play with DuckDB.
$ ls -1 *.parquet | tail -2
yellow_tripdata_2022-05.parquet
yellow_tripdata_2022-06.parquet
$ ls *.parquet | wc -l
162
I decided to load only1 one year’s worth of data:
$ du -ch yellow_tripdata_2011-*.parquet
...
2.1G total
Used the pqrs tool to inspect the schema of the parquet files.
We can query the parquet files even without first loading them into a file. In the following steps, I launched duckdb without specifying a filename, which means the entire operation happened in memory (with likely spillover to the default tmp location).
Enabling profiling PRAGMA enable_profiling
is a cool way to see what happens behind the scenes.
Note: D
is DuckDB’s “prompt”, much like $
for shells or >
is for SQLite.
$ duckdb
v0.0.1-dev0
D PRAGMA enable_profiling;
D SELECT count(*) FROM '*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
SELECT count(*) FROM '*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 0.202s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ UNGROUPED_AGGREGATE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ count_star() │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1 │
│ (0.05s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC=2388294882 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1678919579 │
│ (1.22s) │
└───────────────────────────┘
┌──────────────┐
│ count_star() │
├──────────────┤
│ 1678919579 │
└──────────────┘
We see that duckdb was able to load all the 2021 files (1,678,919,579 ~ 1.6 billion records!) and do a COUNT(*)
in 1.22 seconds.
In the next step I’ll save this parquet data into a table:
D create table nycyellowtaxi as select * from 'yellow_tripdata_2021*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
create table nycyellowtaxi as select * from 'yellow_tripdata_2021*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 5.88s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ CREATE_TABLE_AS │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1 │
│ (41.28s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ VendorID │
│ tpep_pickup_datetime │
│ tpep_dropoff_datetime │
│ passenger_count │
│ trip_distance │
│ RatecodeID │
│ store_and_fwd_flag │
│ PULocationID │
│ DOLocationID │
│ payment_type │
│ fare_amount │
│ extra │
│ mta_tax │
│ tip_amount │
│ tolls_amount │
│ improvement_surcharge │
│ total_amount │
│ congestion_surcharge │
│ airport_fee │
│ EC=30085308 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 30904308 │
│ (6.62s) │
└───────────────────────────┘
30 million trips in 2021
File size of nyc.db
1.7G
also confirmed by
D PRAGMA database_size;
┌───────────────┬────────────┬──────────────┬─────────────┬─────────────┬──────────┬──────────────┬──────────────┐
│ database_size │ block_size │ total_blocks │ used_blocks │ free_blocks │ wal_size │ memory_usage │ memory_limit │
├───────────────┼────────────┼──────────────┼─────────────┼─────────────┼──────────┼──────────────┼──────────────┤
│ 1.7GB │ 262144 │ 6759 │ 6759 │ 0 │ 0 bytes │ 0 bytes │ 27.4GB │
└───────────────┴────────────┴──────────────┴─────────────┴─────────────┴──────────┴──────────────┴──────────────┘
The PRAGMA database_size
produces the database stats on disk.
This is pretty cool.
2022-10-20: more to come.
Footnotes
-
Difference between
yellow_tripdata_2009-11.parquet
andyellow_tripdata_2021-11.parquet
, the latter hasVendorID
field, while the former hasvendor_name
creating issues in importing all files into one duckdb table. ↩