Skip to main content

Command Palette

Search for a command to run...

Your Git Repository Is a Hoarder: A Complete Guide to Fixing It

Updated
8 min read
Your Git Repository Is a Hoarder: A Complete Guide to Fixing It
A

Software Engineer fascinated by what makes software work beneath the surface. I’m working in DevOps, automation, infrastructure, testing, and keen on making the small technical decisions that shape how products evolve. Here, I share the best practices, experiments, insights, and reflections I’ve gathered on building software that’s reliable, scalable, and human-friendly.

Every git clone taking minutes. CI pipelines burning cash downloading dead weight. Developers frustrated during every clone. Sounds familiar?

Here's the uncomfortable truth: that binary someone committed years ago? It's still there. Git never forgets. Even after deletion, every clone, every fetch, every CI run downloads it, Forever. Here's how to fix it.


Repository Bloat Is Inevitable

Git works best with small text files. Any repository with enough history and contributors will accumulate binaries, dependencies, and artifacts. It's not a question of if, but when.

What ends up in repositories:

  • Compiled binaries (build outputs, executables)

  • Build artifacts (node_modules/, .terraform/, __pycache__/)

  • Large test fixtures (database dumps, media files)

  • Files committed before .gitignore rules existed

  • "Temporary" files that became permanent

GitHub's built-in protection is useless for prevention:

  • Warning shown: 50MB

  • Hard block: 100MB

You can commit 50 files at 80MB each. That's 4GB of bloat with zero warnings.


The Three-Pillar Solution

  1. CI Checks: Block large files at PR time (prevent future bloat)

  2. Git LFS: Handle legitimate large files properly

  3. History Cleanup: Remove existing bloat from git history


Pillar 1: CI Checks That Actually Work

Git hooks are client-side. Optional. Bypassable with --no-verify. Every new developer has to remember to set them up.

CI checks are server-side. Mandatory. No one merges without passing.

CI wins. No contest.

  • Developer sets up correctly → Both block

  • Developer forgets setup → Git hooks fail, CI still blocks

  • Developer runs --no-verify → Git hooks bypassed, CI still blocks

  • New developer joins → Git hooks need setup, CI is automatic

The Decision Flowchart

When a file triggers the check:

Is this file supposed to be in the repo?
├── NO → Remove it from git history
└── YES
    ├── Is it a binary/media file? → Use Git LFS
    ├── Is it code/config that naturally grows? → Request exclusion
    └── Can it be compressed/split? → Reduce file size

The Implementation

// In your Jenkinsfile
pipeline {
    stages {
        stage('Check Large Files') {
            steps {
                script {
                    def maxSizeKB = 10
                    def maxSizeBytes = maxSizeKB * 1024

                    // Get commit range                    
                    // Scan all changed files in one command
                    // Get sizes for all blobs at once

                    // Block PR if any file exceeds threshold
                    if (largeFilesFound) {
                        error("Files >= ${maxSizeKB}KB found. Use Git LFS or remove from history.")
                    }
                }
            }
        }
    }
}

Git Commands Used

1. Get current commit SHA

git rev-parse HEAD
```
```
e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3

2. Find the default branch

git remote show origin | grep 'HEAD branch' | cut -d: -f2
```
```
master

3. Find the merge base (common ancestor)

git merge-base HEAD origin/master
```
```
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0

4. Scan commit history for changed files

git log --pretty=format:'COMMIT %H' --raw --no-abbrev origin/master...HEAD
```
```
COMMIT e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3
:100644 100644 abc1234 def5678 M    src/config.yaml
:000000 100644 0000000 ghi9012 A    assets/image.png

COMMIT b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1
:100644 000000 jkl3456 0000000 D    old-file.txt

One call → all commits, all changed files, all blob SHAs. No multiple tree traversals.

5. Get file sizes in batch

echo 'def5678
ghi9012' | git cat-file --batch-check='%(objectname) %(objectsize)'
```
```
def5678 2048
ghi9012 512000

One call → all blob sizes. No need to call git cat-file -s per file.

6. Remove a file from history

git rebase -i master
# Mark the commit with 'edit' instead of 'pick'
git rm --cached <filename>
git commit --amend
git rebase --continue
git push --force
```

---

## What Happens When It Fails

When the check detects large files, it blocks the PR:
```
❌ Large File Detection — failed

Files >= 10KB found in commit history.

Detected files:
  - assets/video-demo.mp4 (150.00KB) [new file] [commit: e4f5a6b7]
  - config/settings.yaml (19.53KB) [new file] [commit: b2c3d4e5]
  - docs/guide.md (13.55KB) [size increased] [commit: a1b2c3d4]
  - data/seed.json (83.03KB) [new file] [commit: c4d5e6f7]

What to do:
  - File type should be allowed? → Add to excludePatterns
  - File shouldn't exist in repo? → Remove from history
  - File can be compressed/split? → Reduce size, re-add
  - File must remain large? → Use Git LFS

Default Exclusions

Code and dependency files that naturally grow large should be automatically excluded, example:

Code files: .py, .rb, .go, .sh, .ts, .tsx, .js, .jsx, .sql

Lock/dependency files: go.sum, requirements.txt, package-lock.json, yarn.lock, Gemfile.lock, poetry.lock, composer.lock, uv.lock


Pillar 2: Git LFS

Git LFS replaces large files with tiny pointer files (~150 bytes) while storing the actual content on a separate server.

The Options

  • GitHub LFS: Zero infrastructure, secure by default, excellent docs. But costly at scale ($5/50GB, every CI clone counts against bandwidth).

  • Rudolfs: Fast (Rust), cheap (your S3), lightweight. But limited security no auth by default, only basic auth or VPN.

  • Giftless: Flexible security (JWT, GitHub auth), multipart uploads, multi-cloud. Heavier (Python), more complex config.

Why We Chose Giftless

Security flexibility was the deciding factor. Giftless supports GitHub authentication natively, meaning developers use their existing gh CLI tokens and Jenkins uses its existing GitHub App token. No new credentials to manage. No VPN friction.

Plus: multipart uploads. A large file upload that fails at 90% with Rudolfs means starting over. With Giftless, you retry just the failed chunk.

The GitHub Auth Setup

Developer Setup (2 minutes, once):

brew install gh                    # Install gh CLI
gh auth login                      # Login (opens browser)
git config --global credential.helper '!gh auth token'

# Done. Just use git normally.
git push  # ← Just works, no VPN needed

Jenkins Setup: Zero changes. Use existing GitHub App token:

Giftless Config

authentication:
  - type: github
    # Validates tokens against GitHub API

Pillar 3: History Cleanup with git-filter-repo

The Options

  • git-filter-branch, Legacy and officially deprecated. Extremely slow, complex syntax, safety issues. Don't use it.

  • BFG Repo-Cleaner, Limited scope. Great for bulk removal by size/filename, but works on filenames only not full paths. Less precise.

  • git-filter-repo, The right choice. 10-50x faster, precise path-based removal, actively maintained, officially Git-recommended.

git-filter-repo handles everything: specific paths, patterns, entire directories. It's what the Git project recommends.

Installation

pip install git-filter-repo
# or
brew install git-filter-repo

Usage

# Remove specific paths
git filter-repo --path path/to/file --invert-paths

# Remove from a list
git filter-repo --paths-from-file paths_to_remove.txt --invert-paths

# Remove by pattern
git filter-repo --path-glob '*.log' --invert-paths

Analyze First: git-sizer

Before cleaning, understand what's causing bloat:

brew install git-sizer
git-sizer --verbose

What to Target for Cleanup

Common culprits that bloat repositories:

  • Large binaries: Compiled executables, build outputs accidentally committed

  • Build artifacts: Output directories, plugin files

  • .gitignore violations: node_modules/, .terraform/, __pycache__/, venv/

  • Sensitive state files: *.tfstate files (security risk may contain secrets!)

  • IDE configs: .idea/, .vscode/

  • Stale Branches


The Cleanup Process Flow


Hours With GitHub Support, Summarized in Minutes

PR References Block Everything

"A single reference anywhere will prevent garbage collection from being able to purge the data completely." - GitHub Support

After rewriting history and force-pushing:

  • Merged PRs continue to hold references to old commits

  • Old commit content remains stored on GitHub's servers

  • Repository size on GitHub won't decrease

  • "Files changed" tab in old PRs still displays original large files

Your Options for PR References

GitHub Support offers two approaches:

  1. Delete PRs entirely: Removes everything, including code review history

  2. Delete only tracking references: Preserves PR body/comments, but code views show errors

Retention Timeline

"There isn't a fixed timeline. GitHub runs garbage collection periodically, not continuously." GitHub Support

~30 days is approximate. Request immediate cleanup by contacting support after rewriting.

Conclusion for the Process

  1. Rewrite history with git-filter-repo

  2. Force push

  3. Provide GitHub Support with SHAs of commits that introduced large files

  4. They identify all remaining references (including PRs)

  5. They handle PR references

  6. They run garbage collection and clear web cache


Key considerations about Clone Types

git clone (what developers/CI use):

  • Does NOT fetch PR refs

  • Does NOT download old PR-only objects

  • Clone size smaller after cleanup

git clone --mirror (for backups/migrations):

  • Fetches everything including PR refs

  • Downloads old PR-only objects

  • Clone size stays large

Developers and CI use normal git clone. They benefit immediately.

GitHub's storage stays large because of PR refs, but your team doesn't download that weight.


Backup Strategy

Always backup before cleanup:

  • Remote branch (git push origin master:backup/pre-cleanup-master) Quick reference, delete after verification

  • Mirror to S3 (git clone --mirror --bare → tar → S3) Full disaster recovery

# Remote backup (protected from force-push)
git push origin master:backup/pre-cleanup-master

# Full mirror backup
git clone git@github.com:org/repo.git --mirror --bare
tar -czf repo-backup.tar.gz repo.git
aws s3 cp repo-backup.tar.gz s3://backups/


Re-Clone Warning

⚠️ Every developer must delete their local copy and re-clone.

Trying to "fix" an old local copy after a history rewrite is the #1 cause of accidentally re-introducing deleted files.

If someone pushes from an old clone, all that bloat comes right back. This cannot be mitigated from GitHub's side.


TL;DR

  1. Prevent future bloat: CI checks (not git hooks)

  2. Handle large files: Giftless + S3 (flexible auth, multipart uploads)

  3. Clean existing bloat: git-filter-repo + GitHub Support for GC

  4. Accept the trade off: GitHub storage may not decrease due to PR refs, but developer clones will be dramatically faster

  5. Re-clone is mandatory

The hardest part isn't the technical implementation. It's getting everyone to actually delete their local clones.


Questions or need help with your setup? Drop a comment. Happy to discuss!