[ git list cc'd, just because maybe somebody was interested in seeing what looks very much like the resolution of this issue.. For people who haven't followed (the private) email exhanges with Sergio about his git corruption, what was going on is that we initially were able to re-generate all but one object (and the objects that were dependent on it through deltas - there were three deltas against it, but two of the deltas happened to delete the corruption and thus only one of the dependent objects actually ended up having the wrong data, in addition to the primary corrupted object, of course).. Junio pinpointed what the filesystem name of that primary object was, and Sergio actually had the original object in his working tree, so Junio could then generate a new pack with the one corrupted object fixed, which obviously meant that all the deltas now worked too. This is my (probably final) analysis of the resulting differences.. ] On Wed, 30 Aug 2006, Junio C Hamano wrote: > > Ok, I was going to attach the resurrected pack that should > contain everything your corrupt pack had, but it is a bit too > large, so I'll place it here [*1*]. Drop me a note when you > retrieved it, so that I can remove it. VERY interesting. The pack you generated looks very different from the original corrupt pack, but when I compare it to one that I generated _from_ that corrupted original by forcing a full repack, they are actually very very similar. [ Side note: the explanation for the difference between my repacked version and the original corrupt pack in turn seems to be fairly straightforward: the older objects in the original pack were created with an older version of git that defaulted to maximum zlib compression, while the newer parts of the pack were done with the current default of "default compression". So when Junio re-created the packfile from scratch with a modern git version, it would now re-compress everything with the modern compression value, and thus the byte-stream of Junio's pack would look very different from Sergio's original one for the older objects. So what I did was to repack the _corrupt_ data with a modern git, to be able to compare the two sanely. ] In fact, doing a hexdump with 'od', and diffing the results, here's the differences: --- od.unfixed 2006-08-30 10:11:36.000000000 -0700 +++ od.fixed 2006-08-30 09:33:02.000000000 -0700 @@ -3358,7 +3358,7 @@ 0150720 2c0c bc8b 2f07 3733 767d 35bb 6bb6 4b28 0150740 0e3c 0ddc 955b eb2d 57e0 754b b1ec b9f7 0150760 8ac8 87bf 9d44 fd11 a041 c4ea cd1d 26de -0151000 94ea 9cf3 7dcd b596 13a3 61bb 48db e69e +0151000 96ea 9cf3 7dcd b596 13a3 61bb 48db e69e 0151020 21e9 1288 6294 8dd4 8b3c 6cc6 e4e7 518f 0151040 6166 3e2d d635 b631 9ec6 0613 aee3 caab 0151060 d4fd 3ab3 6e18 c43a 8dee 15b9 bc6d 0748 @@ -8455,7 +8455,7 @@ 0410140 9f30 2514 f447 0c0d 477e 27c7 a0e1 c9ec 0410160 c0b1 f352 76dd 4ff6 24b1 03c2 0ed2 363a 0410200 034f 637c 8e11 0c86 e2db 2625 75c4 508b -0410220 6fdb bf01 af2d b23b fbb7 0128 49bd 2bd8 +0410220 6fdb bf01 54b6 b23d fbb7 0128 49bd 2bd8 0410240 a76b bca3 7df2 a4c8 e8b9 9081 1d01 a778 0410260 9cad 564d 6f1b 4518 7605 e563 c501 0995 0410300 231a 0151 5255 f6ce ccee ec07 8a22 25eb @@ -394626,7 +394626,7 @@ 30054020 4951 6272 665e 7a6e 6272 517e b1ad d1e4 30054040 a55c 9a40 fd62 a2a9 c925 99f9 79b5 d5aa 30054060 5c0a 4010 03b4 d6a1 a064 b323 f74a c6cd -30054100 df78 a219 27c7 f205 0200 8970 3307 ae17 -30054120 5e11 8f8c e015 00f8 5d51 9152 0d07 79b2 -30054140 45dd +30054100 df78 a219 27c7 f205 0200 8970 3307 4599 +30054120 3854 bacb 704e 7312 11e1 60e3 c14b 0a90 +30054140 e22b 30054142 You can ignore the last hunk, that's just the SHA1 of the pack itself, and that will obviously differ in all 20 bytes, so that difference is not interesting. So the _real_ difference is literally just the one byte at offset 0151000 (decimal 53760) which in the fixed pack is 0x96, and in the corrupt pack it is 0x94. That's a single-bit difference (bit #1 has been cleared). The other difference is at 0410224 (decimal 135316), where the sequence "af2d b23b" should have been "54b6 b23d", and that's just final zlib CRC of that one object (the corrupted object is fairly large: it's 127905 bytes uncompressed, and I think it was ~100kB compressed too, because most of it is the data for the image inside of it). The way I know that: doing "git-show-index" (and sorting by object offset) on the index gives you: ... 13830 2849bd2bd8a76bbca37df2a4c8e8b990811d01a7 135320 42abdeecbf1b49c8354ca9639bd19c378be6d7d4 ... which means that the one-bit error was in the middle of that (known corrupt) object 2849bd2bd8, and the four-byte difference is exactly the four last bytes of that same object - which is obviously where you'd expect to find the crc32 for the compressed data. [ NOTE! The only reason the crc32 has changed is exactly the fact that I forced a full repack of the corrupt pack. In the _original_ pack the object 2849bd2bd8a76bbca37df2a4c8e8b990811d01a7 was at: ... 13369 2849bd2bd8a76bbca37df2a4c8e8b990811d01a7 134859 42abdeecbf1b49c8354ca9639bd19c378be6d7d4 ... and thus the crc32 in the original is the four bytes at 134855, and indeed in the original corrupt pack we see: ... 0407260 d236 3a03 4f63 7c8e 110c 86e2 db26 2575 0407300 c450 8b6f dbbf 0154 b6b2 3dfb b701 2849 ^^ ^^^^ ^^ 0407320 bd2b d8a7 6bbc a37d f2a4 c8e8 b990 811d ... so note how that contains the _original_ CRC, because in the original corrupt pack, the CRC was the one that had been computed with the original (uncorrupted) zlib data, but obviously didn't actually _match_ the actual corrupted data itself - which is why we got a zlib error on unpacking it ] So I think this is pretty damn conclusive. Sergio had a single-bit error in a pack-file, and that error got propagated because "git repack" didn't notice, and because he used unison to synchronize between two different machines, and that obviously happily transferred the corruption. Now, that makes me feel happy on one level, because it's almost certainly a hardware problem - subtle memory corruption, or disk corruption that happened when either reading or writing the image. Sergio may not be that happy about it, of course. It _could_ be something else (hey, it could be the kernel or git itself that has a wild pointer and corrupted a single bit), but I'd say that the hardware is the primary suspect. Now, what to take away from this: - git _did_ find the error, but it would have been easier for everybody if it had noticed it a bit earlier. Ergo: - we should make the repacking verify the object even when it just blindly copies it, so that we do _not_ end up in the situation that a pack-file has a "valid" SHA1, even though the contents are actually corrupt. This should be easy enough, since zlib already has the crc (actually, we should probably do a full unpack of that object, even if it's expensive: if it's a delta, we'd need that to verify that the SHA1 of the base object is valid). - git pack-files are extremely dense (we knew that already, and mostly consider it to be a really good thing), and a single-bit error can be absolutely devastating. For important data, always keep a copy on another machine (that's obviously true regardless of whether you use git or not ;), and _always_ create the copy with git itself, or at least verify it with "git-fsck-objects --full" before you overwrite the previous version. The point being, if you have even a single-bit error (and for all we know, it could have been introduced by the transfer itself, and then been re-introduced in the original place when transferring the data back), you absolutely do _not_ want to transfer that to your backup location too. Finally, this also points out that the corrupted packs _can_ be fixed, but I think Sergio was a bit lucky (to offset all the bad luck). Sergio still had access to the original file that had had its object corrupted. And it took a fair amount of work, and some git hacking by somebody who really understood git (Junio). Maybe we'll end up having some of that effort being useful and checked in, and we'll eventually have more infrastructure for fixing these things, but I suspect that in most cases, even a _single_ bit of corruption will generally result in so much havoc that nobody should depend on that. It's a lot better to have backups. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html