Datasets

Explore the data in your browser

Open in new tab

Download datasets locally

There are several datasets available for use:

  1. Metadata about every file uploaded to PyPI
  2. SQLite dump of all PyPI metadata
  3. Repository metadata
  4. Unique Python files within every release

These datasets allow you to analyse the contents of PyPI without having to download and process every package yourself. All of the statistics within the stats page are periodically generated using the datasets below.

Metadata about every file uploaded to PyPI

About

This dataset contains information about every file within every release uploaded to PyPi, including:

  1. Project name, version and release upload date
  2. File path, size and line count
  3. SHA256 hash

The dataset should be accessed by downloading the files specified within https://github.com/pypi-data/data/raw/main/links/dataset.txt . The following command downloads the dataset from this URL:

$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/dataset.txt")

Using DuckDB to process the dataset

DuckDB is a great tool for processing the dataset. It is very fast and supports SQL queries over Parquet files. The following command uses DuckDB to find the largest file ever uploaded to PyPI:

$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/dataset.txt")
$ duckdb -json -s "select * from '*.parquet' order by lines DESC limit 1"
[
  {
    "project_name": "EvenOrOdd",
    "project_version": "0.1.10",
    "project_release": "EvenOrOdd-0.1.10-py3-none-any.whl",
    "uploaded_on": "2021-02-21 02:25:57.832",
    "path": "EvenOrOdd/EvenOrOdd.py",
    "size": "514133366",
    "hash": "ff7f863ad0bb4413c939fb5e9aa178a5a8855774262e1171b876d1d2b51e6998",
    "skip_reason": "too-large",
    "lines": "20010001"
  }
]

Woah, a whopping 20 million lines of code! Lets confirm it:

$ wget https://files.pythonhosted.org/packages/b2/82/c4265814ed9e68880ba0892eddf1664c48bb490f37113d74d32fe4757192/EvenOrOdd-0.1.10-py3-none-any.whl
$ unzip EvenOrOdd-0.1.10-py3-none-any.whl
$ wc -l EvenOrOdd/EvenOrOdd.py
 20010000 EvenOrOdd/EvenOrOdd.py

$ tail -n6 EvenOrOdd/EvenOrOdd.py
    elif num == 9999996:
        return True
    elif num == 9999997:
        return False
    elif num == 9999998:
        return True
    elif num == 9999999:
        return False
    else:
        raise Exception("Number is not within bounds")

Very funny, I hope this module is a joke 😅

About skipped files

The dataset contains a skip_reason column. If a file is not present in the git repositories then the reason for skipping is recorded here. On the right is a list of the current skip reasons and the number of files excluded from the git repositories for each reason.

The exact reasons for skipping a file are not fully documented here, but ignored files include virtual environments accidentally uploaded to PyPI. text-long-lines means the file had very few lines, but the total size was large.

Skipped reasons:
skip reasoncounttotal size
virtualenv11,699,281112.3 GiB
binary98,573,20558.2 TiB
text-long-lines1,626,791166.5 GiB
too-large8,507,8539.5 TiB
empty38,282,2840 B
version-control228,413219.3 MiB
Total158,917,82768.0 TiB

Current Links

URLSize
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-0.parquet1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-2.parquet1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-3.parquet1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-1.parquet1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-5.parquet1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-4.parquet1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-12.parquet1.8 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-11.parquet1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-6.parquet1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-14.parquet1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-13.parquet1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-9.parquet1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-8.parquet1.7 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-10.parquet1.6 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-7.parquet1.6 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-15.parquet1.6 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-16.parquet1.2 GiB
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/index-17.parquet953.9 MiB
18 links29.9 GiB

Schema

[
  {
    "project_name": "tf-nightly-tpu",
    "project_version": "2.17.0.dev20240222",
    "project_release": "tf_nightly_tpu-2.17.0.dev20240222-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
    "uploaded_on": "2024-02-22 17:29:51.001",
    "path": "packages/tf-nightly-tpu/tf_nightly_tpu-2.17.0.dev20240222-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl/tensorflow/include/external/com_github_grpc_grpc/include/grpc++/impl/codegen/method_handler_impl.h",
    "archive_path": "tensorflow/include/external/com_github_grpc_grpc/include/grpc++/impl/codegen/method_handler_impl.h",
    "size": 987,
    "hash": "\\x0B\\xDBb\\x06\\x16\\xBD\\x84\\xDD\\x12\\x03\\x093\\x9A\\xE7 \\xAC\\xF1\\xCA\\x98\\x0D",
    "skip_reason": "",
    "lines": 28,
    "repository": 258
  },
  {
    "project_name": "rendercv",
    "project_version": "1.5",
    "project_release": "rendercv-1.5-py3-none-any.whl",
    "uploaded_on": "2024-03-27 18:15:31.604",
    "path": "packages/rendercv/rendercv-1.5-py3-none-any.whl/rendercv/tinytex-release/TinyTeX/texmf-dist/fonts/type1/public/amsfonts/cm/cmssi12.pfm",
    "archive_path": "rendercv/tinytex-release/TinyTeX/texmf-dist/fonts/type1/public/amsfonts/cm/cmssi12.pfm",
    "size": 1154,
    "hash": "\\xD1\\xB1\\x11\\x13\\xC2\\xAB\\x94_y6\\x95\\xE6\\x18&\\x19\\xDEr\\xDCJ\\x8D",
    "skip_reason": "binary",
    "lines": 0,
    "repository": 264
  }
]

SQLite dump of all PyPI metadata

About

This is a SQLite dump of all PyPI metadata fetched from the PyPI API. It is updated daily. It can be accessed directly from the following url: https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz:

$ curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite

Links

URLSize
https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz1.2 GiB
1 links1.2 GiB

Schema

CREATE table projects(
    id integer not null primary key,
    name text,
    version text,
    author text,
    author_email text,
    home_page text,
    license text,
    maintainer text,
    maintainer_email text,
    package_url text,
    platform text,
    project_url text,
    requires_python text,
    summary text,
    yanked int,
    yanked_reason text,
    classifiers text,
    requires_dist text,
    UNIQUE(name, version)
);

CREATE table urls(
    project_id int,
    url text,
    upload_time text,
    package_type text,
    python_version text,
    requires_python text,
    size int,
    yanked int,
    yanked_reason text,
    foreign key(project_id) REFERENCES projects(id)
);

Repository Metadata

This dataset contains information about the pypi-data git repositories. The repositories_with_releases.jsonfile contains a list of project names contained within each git repository.

About

Current Links

URLSize
https://github.com/pypi-data/data/raw/main/stats/repositories_with_releases.json66.7 MiB
https://github.com/pypi-data/data/raw/main/stats/repositories.json122.4 KiB
2 links66.8 MiB

Schema

[
  {
    "name": "pypi-mirror-99",
    "index": 99,
    "stats": {
      "earliest_package": "2021-03-18T05:01:02.831403Z",
      "latest_package": "2021-03-26T01:30:18.278384Z",
      "total_packages": 40000,
      "done_packages": 40000
    },
    "percent_done": 100,
    "size": 1441702912,
    "url": "https://github.com/pypi-data/pypi-mirror-99",
    "packages_url": "https://github.com/pypi-data/pypi-mirror-99/tree/code/packages",
    "projects": {}
  }
]

Unique Python files

This dataset contains one row per unique Python file within every release uploaded to PyPI. Only the sha256 hash and a random path to the file is provided. This dataset is useful if you want to parse the Python files yourself, but want to avoid parsing the same file multiple times.

Like the main dataset, the unique files dataset should be accessed by downloading the linksfrom the following file :

$ curl -L --remote-name-all $(curl -L "https://github.com/pypi-data/data/raw/main/links/only_python_files.txt")

About

Current Links

URLSize
https://github.com/pypi-data/data/releases/download/2024-04-30-03-04/unique_python_files.parquet867.2 MiB
1 links867.2 MiB

Schema

[
  {
    "hash": "\\x9E\\x81\\xB3\\x00\\x063\\x9F\\x8F\\x0CV\\x16^a\\x84Lu\\xE6\\xB8\\xCB*",
    "repository": 65,
    "uploaded_on": "2020-04-09 08:19:58.608"
  },
  {
    "hash": "\\xEF\\xC9r\\x1A\\xFA<\\x07Y0U~\\x1A\\xB3\\x8BOA\\x92\\x097\\x9D",
    "repository": 9,
    "uploaded_on": "2015-02-25 16:50:54.479"
  }
]