I am trying to do two things:

  1. Use more Java (ecosystem)
  2. Use less Python (get out of the goldilocks zone)

So, the natural answer is to use Kotlin ;)

I wrote a “throwaway” script to download NYC Yellow Taxi data from here

import java.net.URL
import java.nio.file.Files
import java.nio.file.Paths

fun main(args: Array<String>) {

    for(year in 2009..2022) {
        for (month in 1 .. 12) {
            var uri = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_${year}-${month.toString().padStart(2, '0')}.parquet"
            var fileName = uri.split("/").last()
            if (year == 2022 && month > 6) { //they only have data upto 2022-06
            println("${uri} -> ${fileName}")
            var url = URL(uri)
            // yes, this does not handle exceptions
            // it's a script, YOLO
            url.openStream().use { Files.copy(it, Paths.get(fileName)) }


Some observations about this code:

Followup - I plan to:

  1. Take DuckDB for a spin using these parquet data files.
  2. Play with Tantivy and “search indexes” and see if Tantivy et al can be a replacement for Solr for certain use cases.