Thanks a ton for this, I'll incorporate all of this. On Fri, Sep 5, 2008 at 11:33 PM, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote: > Scott Chacon <schacon@xxxxxxxxx> wrote: >> On Fri, Sep 5, 2008 at 12:41 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: >> > "Scott Chacon" <schacon@xxxxxxxxx> writes: >> > >> >> Also, the last section of the book is on some of the plumbing - mostly >> >> stuff I've found difficult to pick up with the existing documentation >> >> while re-implementing stuff in Ruby. I would really appreciate it if >> >> someone could proofread some of these chapters for errors: >> >> >> >> http://book.git-scm.com/7_the_packfile.html > > OK, time for me to throw in comments. ;-) > > I do like this book, its organized and concise. Thanks for doing it. > > > http://book.git-scm.com/7_how_git_stores_objects.html: > > The loose object formatting of > > header = "#{type} #{size}#body" > store = header + content > > I can't read Ruby so I'm not sure what the header value computes > out to here. #body should be a \0. I'm also not sure that the > prior line setting size = content.length.to_s is very clear for > the non-Ruby people to understand how a size is formatted. > Sorry, the markdown thingy is translating all the '\0's to '#body' for some freaking reason unless I write it as '\\0'. I'll fix this - it's difficult for me to find these sometimes. As for the rest of the ruby stuff, I think I'll add some comments. > If the code shown here is the Ruby implementation I'm a little > concerned about it writing directly into the loose object. If the > write is partial then you have a partial object which is at the > right name, but is unusable. That can give you corruption that > is difficult to track down and fix. C Git and JGit both write > to temporary files then atomically move the temporary file into > position under its proper name only after it has been fully written. That is a good idea - I don't do it that way and I certainly will change the implementation to do so and modify these docs to reflect that advice. > "When objects are written to disk, it is often in the loose format, > since that format is less expensive to access." > > I'm not sure that statement is true. Access from packs tends > to scream compared to access from loose objects. The overheads > of opening and closing the file descriptors, even on Linux, is > what kills performance for data access. However Git writes to > loose objects first and packs later for _safety_ not efficiency. > Although it is a lot more efficient to write a 2 KB loose object > and avoid rewriting a 50 MB pack, but its also less likely to fail > and make you lose your work. Thanks for the clarification. I write to loose objects first largely because it's so much easier to do. But also because I don't mmap objects, so packfile access is not faster for implementations that can't do that very well. Also, I had originally meant "less expensive to write", but I can see that is not clear. > http://book.git-scm.com/7_the_git_index.html: > > I wouldn't say that the index stores permissions. More like it > stores the "class" or "type" of the thing located at that path. > There are 4 major classes: > > - regular file > - executable file > - symbolic link > - git submodule > > The 5th class is the subtree, but only appears in trees and not > in the index since the index file is actually flat. Interesting. This documentation is actually from the User Manual - I'll update this chapter first and if it looks better, I'll submit a patch to the UM, too. > http://book.git-scm.com/7_the_packfile.html: > > You should probably point out that the .idx file uses network byte > order for the numeric fields like the version number and the file > offsets. Will do. > > I'd also point out that the offsets in index v1 are unsigned and > from the start of the pack file. The offsets in index v2 are > also unsigned, but the 1<<31 is tested in the 32 bit offset to > see if a 64 bit offset is used. The algorithm there is: > > if offset32 & 1<<31: > offset = ofs64_table[offset32 & ~(1<<31)] > else > offset = offset32 > > Its also rather unclear how the fan out table can be used to limit > the binary search. What you are missing is describing that fanout[X] > holds the number of objects whose first byte of their SHA-1 is <= X. > Hence fanout[0] has the number of objects whose SHA-1 starts with > "00" and fanout[0x15] has the number of objects whose SHA-1 starts > with "15", "14", "13", ..., "00". Thus fanout[0xff] has the total > number of objects in the pack. > > In the pack file section I'd also point out the version and entry > count are unsigned network byte order. This is not clear from the > Ruby code, although one can guess at it if one knows the git.git > code very very well (like I do). > > "After that, you get a series of packed objects, in order of thier SHAs" > > Aside from s/thier/their/ this is not a correct statement _AT ALL_. > > The ordering of objects in the packfile is very carefully planned > by the packer to maximize data locality from most recent -> least > recent information, making the most recent revisions of a project > the fastest to access. This has _NOTHING_ to do with their SHA-1 > names. > > Technically a pack may store objects in any random order. Heck, > you can wire up an RNG to the packer to always produce a different > ordering each time you pack. Practically an implementation shouldn't > be that stupid and should instead try to order objects by recency, > like git.git and JGit both do. > > "At the end of the packfile is a 20-byte SHA1 sum of all the shas > (in sorted order) in that packfile." > > Also incorrect. The 20-byte checksum at the end of the pack file > is a checksum of all bytes preceeding the checksum itself. We use > it as an end-to-end data integrity check, especially on the network > transport to verify that every bit sent by the one side is received > correctly on the other side. > I'm an idiot. I say this because I actually implemented a bunch of this stuff (in Ruby) and ran into most of these issues when trying to implement it. So I knew these things not 3 weeks ago, but I still wrote it this way. Dur. Thanks for the corrections, I'll update everything accordingly. > BTW, can I just say, I love the graphics in this book. They are > quite well done. Very worthwhile. Thanks. > > > http://book.git-scm.com/7_transfer_protocols.html: > > You might as well explain that the stream returned by upload-pack > uses the same 4 byte line length framing to form "packets", with > the 5th byte (really first byte of the payload) indicating the > "stream": > > - stream 1 ('\001') is the PACK data > - stream 2 ('\002') is progress data/information > - stream 3 ('\003') is the OH S**T we are aborting, died, dead > > You may also want to explain that the way you know the end of the > pack is to read the header, get the entry count, and then read that > many objects from the stream, and then verify the pack checksum. > > -- > Shawn. > Thanks again for all the time it must have taken to review all of this - I'll make sure it gets into the book, and where appropriate, back into the UM or other internal git docs. Scott -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html