Git fscked by SHA-1 collision? Not so fast, says Linus Torvalds

Attack is hard, discovery is easy, so fix it right
rather than right now

About that SHA-1 collision: Linus Torvalds has taken to Google+ to emphasise that in Git, its main role is error detection, so “the sky isn't falling.”

The weak hashing algorithm is used, among other things, to provide a digital signature for software, documents like PDFs, and encryption certificates. The mathematical operation should produce a unique 40-character summary for any given input data, but CWI and Google's work showed it is reasonably possible now to produce SHA-1 “collisions” – for example, two different PDFs gave the same SHA-1 hash.

In the Git software repository system – authored by Torvalds – the SHA-1 hash of a repo's tree can be used to check you are fetching the code you think you're fetching.

An attacker could, using CWI and Google's findings, insert a backdoor in a program's source code in such a way that its repo's hash value collides with the hash value of a legit copy of the source – the hash value is the same for both despite the code being different. Thus, a victim fetching the poisoned source would believe it's an untampered-with “safe” repo, because its hash matches the official legit tree, and ends up building and running a backdoored version.

Not so fast, Torvalds writes. Here's what he said:

First off, the sky isn't falling. There's a big difference between using a cryptographic hash for things like security signing, and using one for generating a 'content identifier' for a content-addressable system like git.

While the algorithm does have a security role even in Git, error correction is more important, Torvalds says, because trust should be a function of the community.

If you fetch a Linux kernel from Linus's repo, it's because that's where you expect the authoritative kernel to be: the hash is there so if someone tampered with the source, the signature won't match, and you know something's wrong. Torvalds sees SHA-1 hashes of repos more as a checksum than proof the data you fetched is the data in the official repo, though.

“In contrast, in a project like git, the hash isn't used for 'trust'. I don't pull on people's trees because they have a hash of a4d442663580. Our trust is in people, and then we end up having lots of technology measures in place to secure the actual data," he said.

“Think of [SHA-1] like 'parity on steroids': it's not able to correct for errors, but it's really really good at detecting corrupt data.” Torvalds continued:

Secondly, the nature of this particular SHA1 attack means that it's actually pretty easy to mitigate against, and there's already been two sets of patches posted for that mitigation.

The reason mitigation is “pretty easy”, he explained, is that to generate a collision, the attacker has to control both the “good” object and the “bad” object – and an attack is detectable on both sides of the collision.

For Git, this makes attacks hard to hide, unlike in the PDF-based proof of concept: “The PDF format acted as the 'black box', and what you see is the printout which has only a very indirect relationship to the PDF encoding.”

In other words, it's easy to stash within the PDF file magic data required to keep the SHA-1 value of the document the same after it has been tampered with. However, that binary data isn't easy to hide in source code without it being spotted by humans or upsetting the toolchain or compiler.

In Git, if "somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice,” Torvalds said, and even if someone found somewhere to hide a collision, git fsck “already warns about those kinds of shenanigans.”

In this lengthy mailing list discussion, Torvalds elaborates: “You also need to make the non-opaque data of the bad object besomething that actually encodes valid git data with interesting hashes in it (for the parent/tree/whatever pointers).”

And even then, discovery is trivial – kernel.org runs git fsck nightly to see if anyone's fooling around with code. He continued:

And finally, there's actually a reasonably straightforward transition to some other hash that won't break the world - or even old git repositories.

“There's a plan, it doesn't look all that nasty, and you don't even have to convert your repository”, he notes. On the mailing list, he says SHA3-256 looks like a sensible replacement hash function. ®

Updated to add

It's been brought to our attention that one reason for Linus's “settle down” message could be a complaint by GNU luminary John Gilmore that he tried to warn against SHA-1 in 2005.

“[Torvalds] wired assumptions about SHA-1 deeply into git. In the next few years, nasty people will teach him the threat model, with ungentle manipulations of his and many other peoples' source trees,” Gilmore said.


Biting the hand that feeds IT © 1998–2017