DuckDB is a great tool for processing the dataset. It is very fast and supports SQL queries over Parquet files. The following command uses DuckDB to find the largest file ever uploaded to PyPI:
$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/dataset.txt")
$ duckdb -json -s "select * from '*.parquet' order by lines DESC limit 1"
[
{
"project_name": "EvenOrOdd",
"project_version": "0.1.10",
"project_release": "EvenOrOdd-0.1.10-py3-none-any.whl",
"uploaded_on": "2021-02-21 02:25:57.832",
"path": "EvenOrOdd/EvenOrOdd.py",
"size": "514133366",
"hash": "ff7f863ad0bb4413c939fb5e9aa178a5a8855774262e1171b876d1d2b51e6998",
"skip_reason": "too-large",
"lines": "20010001"
}
]
Woah, a whopping 20 million lines of code! Lets confirm it:
$ wget https://files.pythonhosted.org/packages/b2/82/c4265814ed9e68880ba0892eddf1664c48bb490f37113d74d32fe4757192/EvenOrOdd-0.1.10-py3-none-any.whl
$ unzip EvenOrOdd-0.1.10-py3-none-any.whl
$ wc -l EvenOrOdd/EvenOrOdd.py
20010000 EvenOrOdd/EvenOrOdd.py
$ tail -n6 EvenOrOdd/EvenOrOdd.py
elif num == 9999996:
return True
elif num == 9999997:
return False
elif num == 9999998:
return True
elif num == 9999999:
return False
else:
raise Exception("Number is not within bounds")
Very funny, I hope this module is a joke 😅