Note that this is not in any way opening its own source code as a file or through shell magic or something like that, in fact it's not looking at its own code at all. It really generates its own hash computationally out of itself, and would work the same if this was a compiled C program computing (not containing!) the hash of its own source code, even if the source code was thrown away.
This may seem impossible at first, but it's really just a funny variant of a quine (a program that prints its own source). http://www.madore.org/~david/computers/quine.html came up with the idea and explains the concept.
[1] You could trivially change it to output its SHA2 hash instead (which just wasn't as common at the time I did this), but then you'd loose the particularly "nice" hash. That property is entirely unrelated to outputting its own hash, and done the same as the commit hashes linked here.
You can easily adjust quines to output any f(program_contents), including f=some_hash_function.
Also, quines are related to fixed point combinators, the most famous of which is the Y Combinator (in case you didn't know where does the name of this fund/forum come from).
Cool site but man it's all just text. Semi related: only aware of some concept like hot loading (Java) regarding code that can be modified while it's running. - following idea of code that writes itself.
As the README.md acknowledges, the usefulness may be limited, except for the fun in experimentation. What may not be obvious to basic Git users is that, while it may take 2^28 commits to fill up the entire address space of the 7-character shorthash, they are not designed to be unique (they are just the first part of the longer, unique hash). As a result, even relatively small repositories often already have _some_ duplicate shorthashes. And people scripting around their Git shorthashes must be prepared to deal with larger shorthashes, like 8-characters, 9, 10, 11, whatever it takes to disambiguate. My random Git repository of a mere 16865 commits (well, that's just "master") that I'm looking at over here, nothing out of the ordinary, needs shorthashes up to 11 characters to disambiguate all of them. (Not all the clashes may be on the same or main branch.)
Does anyone use short hashes for scripting? If the computer is dealing with it, might as well use the whole thing. I only use a short hash for human searching type activities where I sanity check the commit message (I’m probably reviewing code or doing some sort of blame exercise)
You can have two colliding full length hashes in a row, too. It's just not very likely.
7-8 character shorthashes are perfectly reasonable for tracking a few thousand objects without much chance of collision. There's a tradeoff between the uniqueness guarantee and the friendliness if humans ever need to see the identifier.
Of course, sequential identifiers can be even shorter and friendlier, but they are more troublesome in several ways than hash-derived ones.
Exactly this. Short hash to identify builds, generally, seems to a pattern (and one that usually works fine in practice).
It gets a bit dicy once you setup CI to start generating builds/images off every commit to master, and then off every push to every branch. In practice, though, I haven't seen it bite of the projects I've been on. Usually something will break and then you just update the scripts to use N+1 characters.
In theory, a short-sighted script without safeties could do something wacky like deploy an image that's several months/years old. Running the numbers suggests something so catastrophic is quite unlikely.
I'm not the parent commenter, but I'm sure they know that.
I think their point is that the image which a tag refers to can be changed in the repository, so if you're concerned someone might generate a Git commit with a colliding hash and produce another build with the same hash, there is in fact a much simpler attack scenario: replace the container image in the repository with any different image with the same tag.
The syntax they provided identifies the image by its own hash, so producing a different image with the same hash is _much_ more difficult.
According to the Birthday Problem rule of thumb, you would expect collisions starting around sqrt(2^28) = 2^14 = 16384 commits. About the size of your repo!
> The repository has so many commits that git push hangs and runs out of memory, presumably because it tries to regenerate a packfile on the fly.
Any guesses if this would also happen if you tried to push it bit-by-bit? (although you'd of course need reasonably large groups of commits still, to not end up with an impossible number of pushes)
Note that collisions in short hash are not actually a problem as such.
> Git can figure out a short, unique abbreviation for your SHA-1 values. If you pass --abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous
I feel I have seen much smaller repositories than that with ambiguous commits. Case in point: looking at one here with 56 commits in the reflog, already using 8-character shorthands.
AFAIK, Google's g4 is the successor to p4 (perforce), and it uses monotonically increasing number ("CL") instead of hash. From what I can remember, it was one way to specify a build - e.g. this binary was build with CL=12345 (baseline) + cherry picks (e.g. locally checked out) CL's 123450, 123455, 123460 - hence one would always know where it's in the timeline of things based on just this number.
Generously, there are about 40,000 people at Google who might commit to the monorepo. That's only 6,000 or so commits per person, a fairly achievable number. Although since they're not purposely generating every shorthash, it would take significantly longer for the absolute last unique hash to be created.
> The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence.
> Google's codebase is shared by more than 25,000 Google software developers from dozens of offices in countries around the world. On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems.
Note that this was 5 years ago. If you look at the number of commits over time, it is growing exponentially, about sevenfold over the last 5 years:
> If you assume another sevenfold increase over the past 5 years, then they should have around 250 million commits now.
Decent guess. Rate's gone up; they're at about 382 million now. [1] You can see the current number in the PiperOrigin-RevId: on several public repositories that get changes synced from Google's internal Piper.
I'm not looking at that paper (as I'm about to walk out the door); one thing to check is if the number there includes just the actual commits or also the pending CL numbers (same concept as Perforce). The number I'm quoting includes both.
A bit off topic, but does this mean that each of 25k Google developers has a copy of the entire codebase on their laptops? Or is there a way to cherrypick parts of it?
No, because the repo was 90TB as of 2015. Also because that company’s policy forbids any source code on mobile devices, or at least did when I was there.
Note that perforce is not like git. One does not generally clone the whole repo. You create client views that contain the files you want to edit. Everything else stays on the server.
In Perforce you can do that. Indeed, often you have to, as you will have permissions to see only certain paths. One can assume g4 is similar.
Paths are also how one does tags and branches on old school tools like Subversion (just copy your branches/master to tags/releaseX and then never change it)
If a commit touches many files, do the hashes for the blobs and trees contribute to the ambiguity? E.g. does Git choose the commit over the tree if you checkout a short hash that could be either?
You answered your own question. After 2^14 = 16,384 commits, there is a 50% chance of a collision. So most large repos will have collisions.
This isn't a problem for Git. It accepts arbitrary prefixes so you'll just have to type a few more characters if you want to refer to a commit whose 28-bit prefix (7 hex digits) is not unique.
This number may be even lower if you take the birthday problem into account. I’m not a statistics guy to confirm that or to make proper calculations, but I believe it applies to this case as well, because first few bits of a hash are like what a birthday is to an otherwise unique person.
An assignment in one of my university security course was to mine "gitcoin".
Which was a git based proof of work, the server would only accept pushes for commits if it had more leading zeros in its hash than the previous commit on that branch.
Git is a Blocktree - a type of directed acyclic graph based proof of nothing crypto product that is invulnerable to fork based attacks by supporting it out of the box. /s
I commonly use git as an example the blockchain data structure in use before Bitcoin, and that it's really not the vital part of made Bitcoin innovative. It sure ended up as a powerful brand though.
It costed 6500 CPU years to create the first collision containing a valid pdf document. Storing both versions into git still works, because git prepends a header before creating the SHA-1 of the blob. I believe because of that it is harder to create a git blob collision than a pdf collision. (I admit I did neither read very carefully nor try to think it through very seriously.)
A=shasum; C=echo; I='($C "A$s$A$X C$s$C$X I$s$Q$I$Q$X Q$s$k$Q$k$X k$s$Q$k$Q$X X$s$Q$X$Q$X s$s$Q$s$Q$X $I" | $A)'; Q="'"; k='"'; X=';'; s='='; ($C "A$s$A$X C$s$C$X I$s$Q$I$Q$X Q$s$k$Q$k$X k$s$Q$k$Q$X X$s$Q$X$Q$X s$s$Q$s$Q$X $I" | $A)
Note that this is not in any way opening its own source code as a file or through shell magic or something like that, in fact it's not looking at its own code at all. It really generates its own hash computationally out of itself, and would work the same if this was a compiled C program computing (not containing!) the hash of its own source code, even if the source code was thrown away.
This may seem impossible at first, but it's really just a funny variant of a quine (a program that prints its own source). http://www.madore.org/~david/computers/quine.html came up with the idea and explains the concept.
[1] You could trivially change it to output its SHA2 hash instead (which just wasn't as common at the time I did this), but then you'd loose the particularly "nice" hash. That property is entirely unrelated to outputting its own hash, and done the same as the commit hashes linked here.