← Back to home

Blog

Removing LFS from a repository

Background

My final year project involved processing large datasets into pickle files readable by a GR4Rec-based recommender system.

Originally, the repository (~2.5GBs) was hosted on my school's GitLab server and I used Git Large File Storage to manage the datasets. However, as I attempted to migrate the repository to GitHub, I realized this exceeded GitHub's LFS limit of 1GB. Unwilling to buy extra storage, I turned to Google & Stack Overflow.

It would be easier to make a new repository and push all relevant files to GitHub, but I wanted to use this opportunity to figure out more aspects of Git & LFS. This post serves as a record of what I did, mostly for my future reference.

Removing LFS files

1. Clone the repository and head to its directory

git clone <repo.git>
cd <repo>/

2. Remove known files from the repo's history.

Suppose A.pkl and B.pkl have to be removed. This command removes both of them in one go:

git filter-branch --force --index-filter \
"git rm --cached --ignore-unmatch <A.pkl> <B.pkl>" \
--prune-empty --tag-name-filter cat -- --all

Should you want to remove an entire folder, add a -r flag to the second line. This recursively removes the folder and its contents, nested folders and all. Do double check your paths before hitting enter.

Reference

3. These files leave behind some large .pack files... let's clean them up.

git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --aggressive --prune=now

Reference

Finding large files throughout the repository's history

Repository is still huge? Check the repo history for the largest forgotten files via:

$ git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -1
> 56f11b82847fac0fd94d7dfed7980f7bb6270e70 blob   89095702 17777154 824227216

The first column, e.g. 56f11b82847fac0fd94d7dfed7980f7bb6270e70, is a representation of a file. To find the file, run

$ git rev-list --objects --all | grep 56f11b82847fac0fd94d7dfed7980f7bb6270e70
> 56f11b82847fac0fd94d7dfed7980f7bb6270e70 very_large_dataset.pkl

We can see that very_large_dataset.pkl is another large pickle file we didn't remove. We can do that via step 2 above.

Reference

End

After you're happy with the state of the repository, push it to remote. In my case, I made a new repository on GitHub and pushed it there.

git remote add github <repo.git>
git push -u github main

If pushing to an existing repository, you'll be overwriting the history via a force push.