Metadata about every file uploaded to PyPI

About

This dataset contains information about every file within every release uploaded to PyPi, including:

Project name, version and release upload date
File path, size and line count
SHA256 hash

The dataset should be accessed by downloading the files specified within https://github.com/pypi-data/data/raw/main/links/dataset.txt . The following command downloads the dataset from this URL:

$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/dataset.txt")

Using DuckDB to process the dataset

DuckDB is a great tool for processing the dataset. It is very fast and supports SQL queries over Parquet files. The following command uses DuckDB to find the largest file ever uploaded to PyPI:

$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/dataset.txt")
$ duckdb -json -s "select * from '*.parquet' order by lines DESC limit 1"
[
  {
    "project_name": "EvenOrOdd",
    "project_version": "0.1.10",
    "project_release": "EvenOrOdd-0.1.10-py3-none-any.whl",
    "uploaded_on": "2021-02-21 02:25:57.832",
    "path": "EvenOrOdd/EvenOrOdd.py",
    "size": "514133366",
    "hash": "ff7f863ad0bb4413c939fb5e9aa178a5a8855774262e1171b876d1d2b51e6998",
    "skip_reason": "too-large",
    "lines": "20010001"
  }
]

Woah, a whopping 20 million lines of code! Lets confirm it:

$ wget https://files.pythonhosted.org/packages/b2/82/c4265814ed9e68880ba0892eddf1664c48bb490f37113d74d32fe4757192/EvenOrOdd-0.1.10-py3-none-any.whl
$ unzip EvenOrOdd-0.1.10-py3-none-any.whl
$ wc -l EvenOrOdd/EvenOrOdd.py
 20010000 EvenOrOdd/EvenOrOdd.py

$ tail -n6 EvenOrOdd/EvenOrOdd.py
    elif num == 9999996:
        return True
    elif num == 9999997:
        return False
    elif num == 9999998:
        return True
    elif num == 9999999:
        return False
    else:
        raise Exception("Number is not within bounds")

Very funny, I hope this module is a joke 😅

About skipped files

The dataset contains a skip_reason column. If a file is not present in the git repositories then the reason for skipping is recorded here. On the right is a list of the current skip reasons and the number of files excluded from the git repositories for each reason.

The exact reasons for skipping a file are not fully documented here, but ignored files include virtual environments accidentally uploaded to PyPI. text-long-lines means the file had very few lines, but the total size was large.

Skipped reasons:

skip reason	count	total size
too-large	8,628,513	9.7 TiB
text-long-lines	1,639,579	167.8 GiB
virtualenv	11,758,334	113.0 GiB
binary	99,407,171	59.1 TiB
empty	38,740,766	0 B
version-control	230,556	220.7 MiB
Total	160,404,919	69.0 TiB

Current Links

URL	Size
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-0.parquet	1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-2.parquet	1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-3.parquet	1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-1.parquet	1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-5.parquet	1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-4.parquet	1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-12.parquet	1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-11.parquet	1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-6.parquet	1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-14.parquet	1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-13.parquet	1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-9.parquet	1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-8.parquet	1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-10.parquet	1.6 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-7.parquet	1.6 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-15.parquet	1.6 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-16.parquet	1.2 GiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-17.parquet	1,023.2 MiB
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/index-18.parquet	301.0 MiB
19 links	30.2 GiB

Schema

[
  {
    "project_name": "tf-nightly-cpu-aws",
    "project_version": "2.16.0.dev20231112",
    "project_release": "tf_nightly_cpu_aws-2.16.0.dev20231112-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
    "uploaded_on": "2023-11-12 11:24:14.538",
    "path": "packages/tf-nightly-cpu-aws/tf_nightly_cpu_aws-2.16.0.dev20231112-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl/tensorflow/include/Eigen/src/Core/Reshaped.h",
    "archive_path": "tensorflow/include/Eigen/src/Core/Reshaped.h",
    "size": 17004,
    "hash": "\\xC1\\x18F\\x9D\\xD0\\x14\\x5C\\x1B3\\xDA\\x15Y\\x92\\x08\\xA8`\\x8E&\\xDC\\x90",
    "skip_reason": "",
    "lines": 456,
    "repository": 240
  },
  {
    "project_name": "pandas",
    "project_version": "2.2.0rc0",
    "project_release": "pandas-2.2.0rc0-cp312-cp312-macosx_10_9_x86_64.whl",
    "uploaded_on": "2023-12-22 20:08:35.366",
    "path": "packages/pandas/pandas-2.2.0rc0-cp312-cp312-macosx_10_9_x86_64.whl/pandas/core/interchange/__init__.py",
    "archive_path": "pandas/core/interchange/__init__.py",
    "size": 0,
    "hash": "\\xE6\\x9D\\xE2\\x9B\\xB2\\xD1\\xD6CK\\x8B)\\xAEwZ\\xD8\\xC2\\xE4\\x8CS\\x91",
    "skip_reason": "empty",
    "lines": 0,
    "repository": 247
  }
]

SQLite dump of all PyPI metadata

About

This is a SQLite dump of all PyPI metadata fetched from the PyPI API. It is updated daily. It can be accessed directly from the following url: https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz:

$ curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite

Links

URL	Size
https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz	1.2 GiB
1 links	1.2 GiB

Schema

CREATE table projects(
    id integer not null primary key,
    name text,
    version text,
    author text,
    author_email text,
    home_page text,
    license text,
    maintainer text,
    maintainer_email text,
    package_url text,
    platform text,
    project_url text,
    requires_python text,
    summary text,
    yanked int,
    yanked_reason text,
    classifiers text,
    requires_dist text,
    UNIQUE(name, version)
);

CREATE table urls(
    project_id int,
    url text,
    upload_time text,
    package_type text,
    python_version text,
    requires_python text,
    size int,
    yanked int,
    yanked_reason text,
    foreign key(project_id) REFERENCES projects(id)
);

Repository Metadata

This dataset contains information about the pypi-data git repositories. The repositories_with_releases.jsonfile contains a list of project names contained within each git repository.

About

Current Links

URL	Size
https://github.com/pypi-data/data/raw/main/stats/repositories_with_releases.json	67.2 MiB
https://github.com/pypi-data/data/raw/main/stats/repositories.json	123.3 KiB
2 links	67.3 MiB

Schema

[
  {
    "name": "pypi-mirror-99",
    "index": 99,
    "stats": {
      "earliest_package": "2021-03-18T05:01:02.831403Z",
      "latest_package": "2021-03-26T01:30:18.278384Z",
      "total_packages": 40000,
      "done_packages": 40000
    },
    "percent_done": 100,
    "size": 1441702912,
    "url": "https://github.com/pypi-data/pypi-mirror-99",
    "packages_url": "https://github.com/pypi-data/pypi-mirror-99/tree/code/packages",
    "projects": {}
  }
]

Unique Python files

This dataset contains one row per unique Python file within every release uploaded to PyPI. Only the sha256 hash and a random path to the file is provided. This dataset is useful if you want to parse the Python files yourself, but want to avoid parsing the same file multiple times.

Like the main dataset, the unique files dataset should be accessed by downloading the linksfrom the following file :

$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/only_python_files.txt")

About

Current Links

URL	Size
https://github.com/pypi-data/data/releases/download/2024-05-11-03-04/unique_python_files.parquet	874.0 MiB
1 links	874.0 MiB

Schema

[
  {
    "hash": "\\x98\\xEC\\x0CM\\xB7.\\x8A#\\xB3\\xA8\\x0A\\x06\\x1Dk\\x8C`\\x94\\x14\\x99\\xD9",
    "repository": 95,
    "uploaded_on": "2021-02-22 16:54:30.835"
  },
  {
    "hash": "\\xB1\\xC2]W\\x1AB\\x8A\\xB3\\xADH\\x97\\xECj\\x9Bm\\x82\\x8B\\x0C\\xE2\\x8D",
    "repository": 41,
    "uploaded_on": "2019-02-01 20:05:34.645"
  }
]

Datasets

Explore the data in your browser

Download datasets locally

Metadata about every file uploaded to PyPI

About

Using DuckDB to process the dataset

About skipped files

Current Links

Schema

SQLite dump of all PyPI metadata

About

Links

Schema

Repository Metadata

About

Current Links

Schema

Unique Python files

About

Current Links

Schema