Generate a Git repository with 2^28 commits—one for every 7-character shorthash

anyfoo · on July 9, 2021

While we're being silly, here's a shell oneliner that outputs its own SHA1 hash[1]:

A=shasum; C=echo; I='($C "A$s$A$X C$s$C$X I$s$Q$I$Q$X Q$s$k$Q$k$X k$s$Q$k$Q$X X$s$Q$X$Q$X s$s$Q$s$Q$X $I" | $A)'; Q="'"; k='"'; X=';'; s='='; ($C "A$s$A$X C$s$C$X I$s$Q$I$Q$X Q$s$k$Q$k$X k$s$Q$k$Q$X X$s$Q$X$Q$X s$s$Q$s$Q$X $I" | $A)

Note that this is not in any way opening its own source code as a file or through shell magic or something like that, in fact it's not looking at its own code at all. It really generates its own hash computationally out of itself, and would work the same if this was a compiled C program computing (not containing!) the hash of its own source code, even if the source code was thrown away.

This may seem impossible at first, but it's really just a funny variant of a quine (a program that prints its own source). http://www.madore.org/~david/computers/quine.html came up with the idea and explains the concept.

[1] You could trivially change it to output its SHA2 hash instead (which just wasn't as common at the time I did this), but then you'd loose the particularly "nice" hash. That property is entirely unrelated to outputting its own hash, and done the same as the commit hashes linked here.

H8crilA · on July 10, 2021

Ah yes, the Quine: https://en.m.wikipedia.org/wiki/Quine_(computing)

You can easily adjust quines to output any f(program_contents), including f=some_hash_function.

Also, quines are related to fixed point combinators, the most famous of which is the Y Combinator (in case you didn't know where does the name of this fund/forum come from).

chii · on July 10, 2021

That is fascinating!

jcun4128 · on July 10, 2021

Cool site but man it's all just text. Semi related: only aware of some concept like hot loading (Java) regarding code that can be modified while it's running. - following idea of code that writes itself.

sverhagen · on July 9, 2021

As the README.md acknowledges, the usefulness may be limited, except for the fun in experimentation. What may not be obvious to basic Git users is that, while it may take 2^28 commits to fill up the entire address space of the 7-character shorthash, they are not designed to be unique (they are just the first part of the longer, unique hash). As a result, even relatively small repositories often already have _some_ duplicate shorthashes. And people scripting around their Git shorthashes must be prepared to deal with larger shorthashes, like 8-characters, 9, 10, 11, whatever it takes to disambiguate. My random Git repository of a mere 16865 commits (well, that's just "master") that I'm looking at over here, nothing out of the ordinary, needs shorthashes up to 11 characters to disambiguate all of them. (Not all the clashes may be on the same or main branch.)

quickthrower2 · on July 9, 2021

Does anyone use short hashes for scripting? If the computer is dealing with it, might as well use the whole thing. I only use a short hash for human searching type activities where I sanity check the commit message (I’m probably reviewing code or doing some sort of blame exercise)

sverhagen · on July 9, 2021

I have a few places where the shorthash becomes the Docker tag. But you make a good point nonetheless.

alex_young · on July 9, 2021

If you're keeping that many docker images around you have other problems to worry about.

afrodc_ · on July 10, 2021

It's usually a shorthand reference to figure out what commit your image is built from. Unless you're implying something else

mnahkies · on July 10, 2021

I generally use the build number from ci for the tag, and add the git commit and repo URL as a label/annotation (following https://github.com/opencontainers/image-spec/blob/main/annot...)

tedunangst · on July 10, 2021

You can have two colliding short hashes in a row.

mlyle · on July 10, 2021

You can have two colliding full length hashes in a row, too. It's just not very likely.

7-8 character shorthashes are perfectly reasonable for tracking a few thousand objects without much chance of collision. There's a tradeoff between the uniqueness guarantee and the friendliness if humans ever need to see the identifier.

Of course, sequential identifiers can be even shorter and friendlier, but they are more troublesome in several ways than hash-derived ones.

tobyjsullivan · on July 9, 2021

Exactly this. Short hash to identify builds, generally, seems to a pattern (and one that usually works fine in practice).

It gets a bit dicy once you setup CI to start generating builds/images off every commit to master, and then off every push to every branch. In practice, though, I haven't seen it bite of the projects I've been on. Usually something will break and then you just update the scripts to use N+1 characters.

In theory, a short-sighted script without safeties could do something wacky like deploy an image that's several months/years old. Running the numbers suggests something so catastrophic is quite unlikely.

sneak · on July 9, 2021

Docker tags are also themselves not cryptographically secure - the repository owner can replace a tag. Better is to use

    org/imagename@sha256:<hash>

infogulch · on July 10, 2021

The point is to link the docker image to the git commit, that just identifies the image by the image's own hash. Not the same thing.

jhugo · on July 10, 2021

I'm not the parent commenter, but I'm sure they know that.

I think their point is that the image which a tag refers to can be changed in the repository, so if you're concerned someone might generate a Git commit with a colliding hash and produce another build with the same hash, there is in fact a much simpler attack scenario: replace the container image in the repository with any different image with the same tag.

The syntax they provided identifies the image by its own hash, so producing a different image with the same hash is _much_ more difficult.

im3w1l · on July 10, 2021

I would worry about a scenario where I grab the shorthash at some point in time, and it later becoming ambiguous.

kevinventullo · on July 10, 2021

According to the Birthday Problem rule of thumb, you would expect collisions starting around sqrt(2^28) = 2^14 = 16384 commits. About the size of your repo!

emerged · on July 9, 2021

To be clear it’s code to generate a repository, not a repository.

lambdaba · on July 9, 2021

There's a packfile and index file in releases: https://github.com/not-an-aardvark/every-git-commit-shorthas...

app4soft · on July 9, 2021

Yeah.

Linked repo has only 2 commits.[0]

[0] https://github.com/not-an-aardvark/every-git-commit-shorthas...

Ericson2314 · on July 9, 2021

And those two commits are 00000002bdd056473559d2bd0eb835561b3c874b 00000002f7c605501165ee5e3c2db20ffe178848

What the hell?!

surye · on July 9, 2021

Hah, that's clever. The author is using their other toy research tool/project: https://github.com/not-an-aardvark/lucky-commit

hpb42 · on July 9, 2021

lucky-commit has a very beautiful commit history: only 0000000.

redler · on July 9, 2021

It’s commit mining.

quickthrower2 · on July 9, 2021

That’s worth 1 gitcoin. The bonus is it is double spent to all of us!

dang · on July 10, 2021

Ok, we've generated that into the title above.

detaro · on July 9, 2021

> The repository has so many commits that git push hangs and runs out of memory, presumably because it tries to regenerate a packfile on the fly.

Any guesses if this would also happen if you tried to push it bit-by-bit? (although you'd of course need reasonably large groups of commits still, to not end up with an impossible number of pushes)

jamespwilliams · on July 9, 2021

You could set git’s pack.windowMemory option to (say) 128m to avoid git push OOMing.

I expect GitHub wouldn’t handle the push well anyway, though, as mentioned.

pronoiac · on July 9, 2021

Doing the math, 2 to the 28th is around 268 million.

remram · on July 9, 2021

A good time of thumb is 10 powers of two is 3 powers of ten (2^10=1024~=10^3)

So 2^28 is 1/4 of a billion (10^9)

pronoiac · on July 10, 2021

If we were counting bytes of memory, it would be 256MB.

988747 · on July 9, 2021

I can imagine a big company with a monorepo (i.e. Google) reaching that number in a few years.

codetrotter · on July 9, 2021

Note that collisions in short hash are not actually a problem as such.

> Git can figure out a short, unique abbreviation for your SHA-1 values. If you pass --abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous

https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection

and also

> Git doesn't really truncate anything, internally everything will be handled with the complete value.

https://stackoverflow.com/questions/7128444/how-does-github-...

IgorPartola · on July 9, 2021

I wonder how many tools out there hard code the 7 character length for a hit commit hash length and would break upon a collision.

pronoiac · on July 9, 2021

From the "unexpectedly useful for security research" link:

> Due to the birthday problem, any repository that has at least 19291 commits is likely to have a pair of ambiguous commits somewhere.

sverhagen · on July 9, 2021

I feel I have seen much smaller repositories than that with ambiguous commits. Case in point: looking at one here with 56 commits in the reflog, already using 8-character shorthands.

gavinhoward · on July 9, 2021

I can't find the link. :(

Edit: nevermind; I am stupid.

malkia · on July 9, 2021

AFAIK, Google's g4 is the successor to p4 (perforce), and it uses monotonically increasing number ("CL") instead of hash. From what I can remember, it was one way to specify a build - e.g. this binary was build with CL=12345 (baseline) + cherry picks (e.g. locally checked out) CL's 123450, 123455, 123460 - hence one would always know where it's in the timeline of things based on just this number.

whatshisface · on July 9, 2021

Generously, there are about 40,000 people at Google who might commit to the monorepo. That's only 6,000 or so commits per person, a fairly achievable number. Although since they're not purposely generating every shorthash, it would take significantly longer for the absolute last unique hash to be created.

charcircuit · on July 9, 2021

Don't forget the commits made by programs.

sltkr · on July 9, 2021

Google published some info about their repository in 2016: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

Some choice quotes:

> The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence.

> Google's codebase is shared by more than 25,000 Google software developers from dozens of offices in countries around the world. On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems.

Note that this was 5 years ago. If you look at the number of commits over time, it is growing exponentially, about sevenfold over the last 5 years:

https://dl.acm.org/cms/attachment/c219bd10-b97a-402e-bc94-79...

If you assume another sevenfold increase over the past 5 years, then they should have around 250 million commits now.

scottlamb · on July 10, 2021

> If you assume another sevenfold increase over the past 5 years, then they should have around 250 million commits now.

Decent guess. Rate's gone up; they're at about 382 million now. [1] You can see the current number in the PiperOrigin-RevId: on several public repositories that get changes synced from Google's internal Piper.

I'm not looking at that paper (as I'm about to walk out the door); one thing to check is if the number there includes just the actual commits or also the pending CL numbers (same concept as Perforce). The number I'm quoting includes both.

[1] https://github.com/google/tcmalloc/commit/6f42c40d6ebcf2aeb8...

jeffbee · on July 10, 2021

That’s going to be at least some kind of fun when it crosses 2^31.

ahmedfromtunis · on July 10, 2021

A bit off topic, but does this mean that each of 25k Google developers has a copy of the entire codebase on their laptops? Or is there a way to cherrypick parts of it?

jeffbee · on July 10, 2021

No, because the repo was 90TB as of 2015. Also because that company’s policy forbids any source code on mobile devices, or at least did when I was there.

Note that perforce is not like git. One does not generally clone the whole repo. You create client views that contain the files you want to edit. Everything else stays on the server.

fennecfoxen · on July 10, 2021

In Perforce you can do that. Indeed, often you have to, as you will have permissions to see only certain paths. One can assume g4 is similar.

Paths are also how one does tags and branches on old school tools like Subversion (just copy your branches/master to tags/releaseX and then never change it)

remram · on July 9, 2021

If a commit touches many files, do the hashes for the blobs and trees contribute to the ambiguity? E.g. does Git choose the commit over the tree if you checkout a short hash that could be either?

yellow_lead · on July 9, 2021

I wonder what the number of commits is before you need to start worrying about 7 character collisions. (Birthday problem anyone?)

sltkr · on July 9, 2021

You answered your own question. After 2^14 = 16,384 commits, there is a 50% chance of a collision. So most large repos will have collisions.

This isn't a problem for Git. It accepts arbitrary prefixes so you'll just have to type a few more characters if you want to refer to a commit whose 28-bit prefix (7 hex digits) is not unique.

kleton · on July 10, 2021

Piper gets at least 10s or even 100s of CLs a second

wruza · on July 9, 2021

This number may be even lower if you take the birthday problem into account. I’m not a statistics guy to confirm that or to make proper calculations, but I believe it applies to this case as well, because first few bits of a hash are like what a birthday is to an otherwise unique person.

https://en.wikipedia.org/wiki/Birthday_problem

est31 · on July 10, 2021

See also: https://blog.cuviper.com/2013/11/10/how-short-can-git-abbrev...

queuebert · on July 9, 2021

Rust is now mature enough that subversives are using it.

mrkramer · on July 9, 2021

So this is basically a proof of work algorithm.

posnet · on July 9, 2021

An assignment in one of my university security course was to mine "gitcoin".

Which was a git based proof of work, the server would only accept pushes for commits if it had more leading zeros in its hash than the previous commit on that branch.

distrill · on July 9, 2021

That sounds like a ton of fun, and tbh way cooler than anything I built in school.

ac42 · on July 10, 2021

I bet they later spoke about randomised algorithms like[1]

[1] https://en.wikipedia.org/wiki/HyperLogLog

colejohnson66 · on July 9, 2021

Git, but on a Blockchain? /s

Cerium · on July 9, 2021

Git is a Blocktree - a type of directed acyclic graph based proof of nothing crypto product that is invulnerable to fork based attacks by supporting it out of the box. /s

quickthrower2 · on July 9, 2021

So like IOTA! /s

tylersmith · on July 10, 2021

I commonly use git as an example the blockchain data structure in use before Bitcoin, and that it's really not the vital part of made Bitcoin innovative. It sure ended up as a powerful brand though.

curiousllama · on July 10, 2021

GitHub devops folks: edge case incoming in 3… 2… 1…

a-dub · on July 10, 2021

i wonder how git would break in the face of an actual sha-1 collision...

usr1106 · on July 10, 2021

The second file is just not stored, because git thinks it's already there. Nothing complicated there.

Discussed here: https://stackoverflow.com/questions/9392365/how-would-git-ha...

It costed 6500 CPU years to create the first collision containing a valid pdf document. Storing both versions into git still works, because git prepends a header before creating the SHA-1 of the blob. I believe because of that it is harder to create a git blob collision than a pdf collision. (I admit I did neither read very carefully nor try to think it through very seriously.)

swiftcoder · on July 10, 2021

On the other hand, if you are still running subversion, you pretty much lose the repo in this case: https://arstechnica.com/information-technology/2017/02/water...

COMMENT___ · on July 10, 2021

This was fixed years ago.

swiftcoder · on July 13, 2021

Where (years <= 4), sure. 2017 is not exactly ancient history.

gopherbro · on July 9, 2021

It seems someone write a script for it.