The Biggest and Weirdest Commits in Linux Kernel Git History

Posted on 2017-02-12

We normally think of git merges as having two parent commits. For example, the most recent Linux kernel merge as I write this is commit 2c5d955, which is part of the run-up to release 4.10-rc6. It has two parents:

2c5d955 Merge branch 'parisc-4.10-3' of ...
|
*- 2ad5d52 parisc: Don't use BITS_PER_LONG in use ...
*- 53cd1ad Merge branch 'i2c/for-current' of ...

Git also supports octopus merges, which have more than two parents. This seems strange for those of us who work on smaller projects: wouldn't a merge with three or four parents be confusing? Well, it depends. Sometimes, a kernel maintainer needs to merge dozens of separate histories together at once. Having 30 merge commits, one after another, would be more confusing than a single 30-way merge, especially if that 30-way merge was conflict-free.

Octopuses are more common than you might expect. There are 649,306 commits in the kernel's history. 46,930 (7.2%) are merges. Of the merges, 1,549 (3.3%) are octopus merges. (This is as of commit 566cf87, which is my current HEAD.)

$ git log --oneline | wc -l
   649306
$ git log --oneline --merges | wc -l
   46930
$ git log --oneline --min-parents=3 | wc -l
    1549

As a comparison point, 20% of all Rails commits are merges (12,401 out of 63,111), but it has zero octopus merges. Rails is probably more representative of the average project; I expect that most git users don't know that octopus merges are even possible.

Now, the obvious question: how big do these octopus merges get? The ">" lines here are continuations; the command is written in five lines total. All of the commands in this post are as I typed them into the terminal while experimenting, so they're not necessarily easy to read. I'm more interested in the conclusions and include code only for the curious.

$ (git log --min-parents=2 --pretty='format:%h %P' |
>  ruby -ne '/^(\w+) (.*)$/ =~ $_; puts "#{$2.split.count} #{$1}"' |
>  sort -n |
>  tail -1)
66 2cde51f

66 parents! That's a lot of parents. What happened?

$ git log -1 2cde51f
commit 2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
Merge: 7471c5c c097d5f 74c375c 04c3a85 5095f55 4f53477
2f54d2a 56d37d8 192043c f467a0f bbe5803 3990c51 d754fa9
516ea4b 69ae848 25c1a63 f52c919 111bd7b aafa85e dd407a3
71467e4 0f7f3d1 8778ac6 0406a40 308a0f3 2650bc4 8cb7a36
323702b ef74940 3cec159 72aa62b 328089a 11db0da e1771bc
f60e547 a010ff6 5e81543 58381da 626bcac 38136bd 06b2bd2
8c5178f 8e6ad35 008ef94 f58c4fc4 2309d67 5c15371 b65ab73
26090a8 9ea6fbc 2c48643 1769267 f3f9a60 f25cf34 3f30026
fbbf7fe c3e8494 e40e0b5 50c9697 6358711 0112b62 a0a0591
b888edb d44008b 9a199b8 784cbf8
Author: Mark Brown <[email redacted for privacy]>
Date:   Thu Jan 2 13:01:55 2014 +0000

    Merge remote-tracking branches [65 remote branch names]

This broke some history visualization tools, provoking a reaction from Linus Torvalds:

I just pulled the sound updates from Takashi, and as a result got your merge commit 2cde51fbd0f3. That one has 66 parents.

[...]

It's pulled, and it's fine, but there's clearly a balance between "octopus merges are fine" and "Christ, that's not an octopus, that's a Cthulhu merge".

From what I can see, this unusual 66-parent commit was an otherwise mundane merge of various changes to the ASoC code. ASoC stands for ALSA System on Chip. ALSA is the sound subsystem; "system on a chip" is a term for a computer packed into a single piece of silicon. Putting those together, ASoC is sound support for embedded devices.

Now, how often do merges like this happen? Never! The second-place merge is fa623d1 with "only" 30 parents. However, the large distance from 30 to 66 parents isn't surprising with sufficient context.

The number of parents for a git commit is probably distributed according to a fat one-sided distribution (often informally called a power law distribution, but that's usually not strictly correct for reasons that aren't interesting here). Many properties of software systems fall into fat one-sided distributions. Hold on; I'll generate a plot to be sure... (much nitpicking of chart layout ensues). Yes, it's fat and one-sided:

Parents per commit in linux kernel log log

To be terse and coarse about it, "fat one-sided" means that there are far more small things than large things, but also that the maximum size of the things is unbounded. The kernel contains 45,381 two-parent merges, but only one 66-parent merge. Given enough additional development history, we can expect to see a merge with more than 66 parents.

Lines of code per function or per module are also fat and one-sided (most functions and modules will be small, but some will be large; think of a "User" class in a web app). Likewise for the rate of change for modules (most modules will change infrequently, but some will change constantly; think of "User" again). These distributions pop up everywhere in software, appearing as straight lines on log-log plots like this one.

OK, so much for the biggest merge in terms of parent count. What about the biggest merge in terms of divergence? By divergence, I mean the difference between the two branches being merged. We can measure that by simply diffing the merge's parents against each other and counting the lines in the diff.

For example, if a branch diverged from master a year ago, changed one line, and then was merged back into master, all of the changes to master during that time would be counted, as would the changes on our branch. We can come up with more intuitive notions of divergence, but they're difficult or impossible to calculate because git doesn't retain branch metadata.

In any case, as a starting point for calculating divergence, here's the divergence for the most recent kernel merge:

$ git diff $(git log --merges -1 --pretty='format:%P') | wc -l
     173

In English, this command reads: "diff the two parents of the most recent merge against each other, then count the lines." To find the most-diverged merges, we can loop through every merge commit, counting the number of diff lines in a similar way. Then, as a test, we'll search the results for all merges with exactly 2,000 lines of divergence.

$ (git log --merges --pretty="%h" |
   while read x; do
     echo "$(git diff $(git log --pretty=%P $x -1) | wc -l)" $x
   done > merges.txt)
$ sort -n merges.txt | grep '\b2000\b'
    2000 3d6ce33
    2000 7fedd7e
    2000 f33f6f0

(This command takes a long time to run: around twelve hours, I think, though I was away for much of it.)

I expect merge size to follow a fat one-sided distribution, just like the parent counts did. It should show up as a straight line on a log-log plot. Let me check... yep:

Merges per diff length log log

(I've binned the diff sizes by rounding them into 1,000-line buckets; otherwise there aren't enough samples to form a useful curve.)

The bottom right is ugly partly due to quantization and partly due to small sample sizes caused by a lack of huge commits, as with the previous plot.

Now, the obvious question: what's the most-diverged merge in history?

$ sort -n merges.txt | tail -1
 22445760 f44dd18

22,445,760 lines of diff! This seems impossibly large – the diff is longer than the entire source code of the kernel.

Greg Kroah-Hartman made this commit on September 19, 2016, during development of 4.8-rc6. Greg is one of Linus Torvalds' "lieutenants" – his close, trusted developers. Roughly speaking, lieutenants form the first level of the Kernel's pull request tree. Greg maintains the stable branch of the kernel, the driver core, the USB subsystem, and several other subsystems.

We need a bit of background before examining this merge more closely. Normally, we think of merges as part of a diamond branch-then-merge pattern:

  A
 / \
B   C
 \ /
  D

Back in 2014, Greg started development on Greybus (a bus for mobile devices) in a fresh repo, as if he were starting a totally new project. Eventually, development on Greybus was finished, and it was merged into the kernel. But, because it was started in a fresh repo, it shared no history with the rest of the kernel source. That merge added another "initial commit" to the kernel, in addition to the commit back in 2005 that we normally think of as the initial commit. Instead of the usual diamond branch-and-merge pattern, the repo now had two separate initial commits:

  A
 / \
B   C

We can see some evidence of this by looking at how many files exist in each of the merge commit's parents:

$ git log -1 f44dd18 | grep 'Merge:'
Merge: 9395452 7398a66
$ git ls-tree -r 9395452 | wc -l
   55499
$ git ls-tree -r 7398a66 | wc -l
     148

One side has a lot of files because it contains the entire kernel source. The other contains few because it's a separate history containing only Greybus.

Like octopus merges, this will strike some git users as strange. But the kernel developers are expert git users and tend to use its features with abandon, though certainly not reckless abandon.

One final question: how many times has this happened? How many separate "initial" commits does the kernel have? Four, as it turns out:

$ git log --max-parents=0 --pretty="format:%h %cd %s" --date=short
a101ad9 2016-02-23 Share upstreaming patches
cd26f1b 2014-08-11 greybus: Initial commit
be0e5c0 2007-01-26 Btrfs: Initial checkin, basic working tree code
1da177e 2005-04-16 Linux-2.6.12-rc2

Just to be clear, if we drew these commits, ignoring all other history, it would look like the graph below.

566cf87 (the current HEAD)
| | | |
| | | *- a101ad9 Share upstreaming patches
| | |
| | *- cd26f1b greybus: Initial commit
| |
| *- be0e5c0 Btrfs: Initial checkin, basic working tree code
|
*- 1da177e Linux-2.6.12-rc2

Each of these four is a distant ancestor of the current kernel HEAD, and none of them has a parent commit. From git's perspective, the kernel history "begins" four different times, with all of those eventually being merged together.

The first of these four (at the bottom of our output) is what we usually think of as the initial commit to git back in 2005. The second is the development of the file system btrfs, which was done in isolation. The third is Greybus, also done in isolation, which we already saw.

The fourth initial commit, a101ad9, is weird. Here it is:

$ git show --oneline --stat a101ad9
a101ad9 Share upstreaming patches
 README.md | 2 ++
 1 file changed, 2 insertions(+)

It just creates a file README.md. But then, it's immediately merged into the normal kernel history in commit e5451c8!

$ git show e5451c8
commit e5451c8f8330e03ad3cfa16048b4daf961af434f
Merge: a101ad9 3cf42ef
Author: Laxman Dewangan <ldewangan@nvidia.com>
Date:   Tue Feb 23 19:37:08 2016 +0530

Why would someone create a new initial commit containing a two-line README file, then immediately merge it into the mainline history? I can't come up with any reason, so I suspect that this was an accident! It doesn't do any harm, though; it's just very strange. (Update: it was an accident, which Linus responded to in his usual fashion.)

Incidentally, this is also the second-most-diverged commit in the history, simply because it's a merge of an unrelated commit, just like the Greybus merge that we looked at more closely.

There you have it: some of the weirdest things in the Linux kernel's git history. There are 1,549 octopus merges, one of which has 66 parents. The most heavily diverged merge has 22,445,760 lines of diff, though it's a bit of a technicality because it shares no history with the rest of the repo. The kernel has four separate "initial" commits, one of which was a mistake. None of this will show up in the vast majority of git repos, but all of it is well within git's design parameters.

(If you liked this post, you might like the Destroy All Software screencasts, several of which build up the kinds of complex shell commands seen in this post, methodically and piece by piece. "History Spelunking With Unix" is especially relevant.)