Download PyPI

Step 1: Ensure you have space

The current size of all the repositories is 439.4 GB. Make sure you have enough space on your machine before continuing.

Step 2: Clone the repositories

Clone the repositories using the following command:

wget https://py-code.org/download.sh
chmod +x download.sh
./download.sh pypi_code

This will create a new directory called pypi_code and begin fetching all the data from GitHub. This will take several hours.

download.sh contents:

#!/usr/bin/env bash

if [[ $# -eq 0 ]] ; then
    echo 'Usage: [path]'
    exit 1
fi

mkdir -p "$1"

for url in $(curl https://raw.githubusercontent.com/pypi-data/data/main/links/repositories.txt); do
    git -C "$1" clone "$url" --depth=1 --no-checkout --branch=code
done

Step 3: Use the data!

The data is available by standard git tooling. To list all the files within the 4suite-xml package you could run:

git rev-list --no-object-names --all --objects --filter=object:type=blob --all -- 'packages/4suite-xml/' |  git cat-file --batch

And listing all files can be done with:

git rev-list --objects --all

There is also a dataset of all the unique Python files available for download. See here for more information.