Step 1: Ensure you have space
The current size of all the repositories is 439.4 GB. Make sure you have enough space on your machine before continuing.
Step 2: Clone the repositories
Clone the repositories using the following command:
wget https://py-code.org/download.sh
chmod +x download.sh
./download.sh pypi_code
This will create a new directory called pypi_code
and begin fetching all the data from GitHub. This will take several hours.
download.sh
contents:
#!/usr/bin/env bash
if [[ $# -eq 0 ]] ; then
echo 'Usage: [path]'
exit 1
fi
mkdir -p "$1"
for url in $(curl https://raw.githubusercontent.com/pypi-data/data/main/links/repositories.txt); do
git -C "$1" clone "$url" --depth=1 --no-checkout --branch=code
done
Step 3: Use the data!
The data is available by standard git tooling. To list all the files within the 4suite-xml package you could run:
git rev-list --no-object-names --all --objects --filter=object:type=blob --all -- 'packages/4suite-xml/' | git cat-file --batch
And listing all files can be done with:
git rev-list --objects --all
There is also a dataset of all the unique Python files available for download. See here for more information.