Git: find the largest commits

Time to go commit mining!

Recently, I was working in a new repository and found the git blame output often pointed back to a large repository-wide formatting commit (applying Black to all Python files). To ignore this commit in git blame, I added its SHA to a .git-blame-ignore-revs file, per this documentation:

# Format everything with Black
55c0bf219272801586b04c5be691e3aedcfc7254

While writing the .git-blame-ignore-revs file, I got wondering if there were any other large commits worth blame-ignoring.

Finding such commits is not straightforward with Git, as there’s no command to list commits sorted by the number of lines they change. Hence, I wrote the below script, which uses the output of git log to count changes and sort by them. Run it with Python (3.7+, I think) to list all commits in the repository, largest first. For example, in Django’s repository:

$ python git_largest_commits.py
Changes SHA                                             Subject
285412  de8565e1c48f1c386a7b256e1ae585cbd8ff11b2        Removed app translation strings from core translation files.
257061  f27a4ee3270bd57299ce02d622978ac4d839137e        Removed django.contrib.localflavor.
235861  9c19aff7c7561e3a82978a272ecdaad40dda5c00        Refs #33476 -- Reformatted code with Black.
232590  efa67b897b6ed5c6bbee1aa2646f4ba7ea6e2bc2        Fetched translations from Transifex
164493  7be43c910abbf538bba65cc8304896bdd1ba1d37        Added new translation files to localflavor contrib app.
...

Because the output is long, you will probably want to pipe it into less to avoid swamping your terminal:

$ python git_largest_commits.py | less

Here’s the script:

"""
List Git commits reachable from the current commit, sorted by the number of
changes they made, largest first.

https://adamj.eu/tech/2025/07/20/git-find-largest-commits/
"""

import math
import os
import re
import subprocess
import sys


def main():
    result = subprocess.run(
        ["git", "log", "--pretty=format:%H\t%s", "--shortstat", "--no-merges"],
        capture_output=True,
        text=True,
        check=True,
    )
    if result.returncode != 0:
        print(result.stdout)
        print(result.stderr, file=sys.stderr)
        return result.returncode

    commit_details = []
    lines = result.stdout.splitlines()
    i = 0
    while i < len(lines):
        commit_line = lines[i]
        if i + 1 < len(lines) and lines[i + 1].startswith(" "):
            stats_line = lines[i + 1]
            i += 3  # move past commit, stats, and blank lines
        else:
            # Empty commit
            stats_line = ""
            i += 1  # move past commit line only

        total_changes = 0
        if stats_line:
            matches = re.findall(r"(\d+) (?:insertion|deletion)", stats_line)
            total_changes = sum(int(match) for match in matches)
        commit_details.append((total_changes, commit_line))

    if not commit_details:
        print("No commits found.", file=sys.stderr)
        return 1

    commit_details.sort(key=lambda x: x[0], reverse=True)

    # Calculate width based on largest number of changes
    max_changes = commit_details[0][0]
    if max_changes == 0:
        width = 7  # "Changes"
    else:
        num_digits = len(str(max_changes))
        width = math.ceil(num_digits / 3) * 3

    sha_width = len(commit_details[0][1].split("\t")[0])

    # Format and output
    try:
        print(f"{'Changes':<{width}}\t{'SHA':<{sha_width}}\tSubject")
        for changes, commit in commit_details:
            print(f"{changes:{width}d}\t{commit}")
        sys.stdout.flush()
    except BrokenPipeError:
        # Python flushes standard streams on exit; redirect remaining output
        # to devnull to avoid another BrokenPipeError at shutdown
        devnull = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, sys.stdout.fileno())

    return 0


if __name__ == "__main__":
    raise SystemExit(main())

The script takes these steps:

  1. Run git log with specific options to outupt commit hashes, subjects, and short statistics. The output looks like:

    d63241ebc7067fdebbaf704989b34fcd8f26bbe9        Fixed #15727 -- Added Content Security Policy (CSP) support.
     26 files changed, 1192 insertions(+), 1 deletion(-)
    
    3f59711581bd22ebd0f13fb040b15b69c0eee21f        Fixed #36366 -- Improved accessibility of pagination in the admin.
     9 files changed, 118 insertions(+), 33 deletions(-)
    

    There’s one wrinkle: empty commits only display the commit hash and subject, without the statistics line or blank lines:

    0f94972033f4b27be6c902a6764c5d3d802ddea2        Example empty commit
    d63241ebc7067fdebbaf704989b34fcd8f26bbe9        Fixed #15727 -- Added Content Security Policy (CSP) support.
     26 files changed, 1192 insertions(+), 1 deletion(-)
    
  2. Parse the output, summing the insertions and deletions to get the total number of changes for each commit.

  3. Sort by the total number of changes, largest first.

  4. Output the results in a tabular format, with some calculation to ensure columns are aligned.

    The BrokenPipeError handling here prevents errors when piping into less or similar commands, per my previous post.

Fin

Enjoy using the above. May you find some useful commits to ignore in your own repositories!

—Adam


😸😸😸 Check out my new book on using GitHub effectively, Boost Your GitHub DX! 😸😸😸


Subscribe via RSS, Twitter, Mastodon, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: