Hey, On Sat, Jun 6, 2009 at 2:38 PM, Jakub Narebski<jnareb@xxxxxxxxx> wrote: > There are beginnings of description of git pack protocol in section > "Transfer Protocols"[1][2] of chapter "7. Internals and Plumbing" > of "Git Community Book" (http://book.git-scm.com). > > [1] http://book.git-scm.com/7_transfer_protocols.html > [2] http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown > > This is second round of my comments about this item. I'd like to have > some more comments about git pack protocol before trying to come up > with formulation which is good enough to send as patch against source > of mentioned section. > I can certainly fix up this chapter with these comments - I understand the protocol a bit better now than I did when I originally wrote this. In addition to that, I started taking a shot at putting together an RFC formatted documentation of this protocol as was requested. I may have _way_ missed the mark on what you were looking for originally, it's hard to say, not having read a lot of RFC documents - I probably ended up writing in a more bookish format rather than a technical spec, but whatever - maybe you'll find it helpful or can fix it up to more what you were expecting. I'm not done with it - some of it is still basically unformatted comments from this previous thread, but at least it's laid out roughly how I thought it might be useful and I have fleshed out a lot of it. You can find the RFC text output document here: http://git-scm.com/gitserver.txt And the xml doc I generated it from here: http://github.com/schacon/gitserver-rfc Perhaps if we're going to spend time getting this all correct, we should get a standalone technical doc all agreed upon, then I can relatively easily extract what's needed into that chapter of the Community book. Thoughts? Scott > The relevant parts of above source are quoted as if they were email > I am replying too. > > I have CC-ed everybody who participated in this subthread (originally > named "Re: Request for detailed documentation of git pack protocol"). > > .... >> ### Fetching Data with Upload Pack ### >> >> For the smarter protocols, fetching objects is much more efficient. A >> socket is opened, either over ssh or over port 9418 (in the case of >> the git:// protocol), and the git-fetch-pack(1) command on the client >> begins communicating with a forked git-upload-pack(1) process on the >> server. >> >> Then the server will tell the client which SHAs it has for each ref, >> and the client figures out what it needs and responds with a list of >> SHAs it wants and already has. > > It would be probably more clear here to state explicitely that there > are two lists, i.e. "a list of SHAs it wants and a list of SHAs it > already has". > >> >> At this point, the server will generate a packfile with all the >> objects that the client needs and begin streaming it down to the >> client. > > This is a bit of oversimplification. In most simple case like client > using git-clone to get all objects it is true that server can generate > packfile and stream it to client after client tells a list of wanted > SHAs. In more complicated case however there can be series of > exchanges between client and server, with client sending sets of > commits it have, and server responding whether it is enough (or > perhaps this line of commits is uninteresting)... and only then > arriving at list of objects to send in a packfile. > >> >> Let's look at an example. > > I think that before example we should have short description (sketch) > of the whole exchange; for example the one taken from > 'Documentation/technical/pack-protocol.txt': > > upload-pack (S) | fetch/clone-pack (C) protocol: > > # Tell the puller what commits we have and what their names are > S: SHA1 name > S: ... > S: SHA1 name > S: # flush -- it's your turn > # Tell the pusher what commits we want, and what we have > C: want name > C: .. > C: want name > C: have SHA1 > C: have SHA1 > C: ... > C: # flush -- occasionally ask "had enough?" > S: NAK > C: have SHA1 > C: ... > C: have SHA1 > S: ACK > C: done > S: XXXXXXX -- packfile contents. > > >> >> The client connects and sends the request header. The clone command >> >> $ git clone git://myserver.com/project.git >> >> produces the following request: >> >> 0032git-upload-pack /project.git\\000host=myserver.com\\000 > > Although fetching via SSH protocol is, I guess, much more rare than > fetching via anonymous unauthenticated git:// protocol, it _might_ be > good idea to tell there that fetching via SSH differs from above > sequence that instead of opening TCP connection to port 9418 and > sending above packet, and later reading from and writing to socket, > "git clone ssh://myserver.com/srv/git/project.git" calls > > ssh myserver.com git-upload-pack /srv/git/project.git > > and later reads from standard output of the above command, and writes > to standard input of above command. > > The rest of exchange is _identical_ for git:// and for ssh:// (and > I guess also for file:// pseudoprotocol). > >> >> The first four bytes contain the hex length of the line (including 4 >> byte line length and trailing newline if present). Following are the >> command and arguments. This is followed by a null byte and then the >> host information. The request is terminated by a null byte. > > I think it would be better to describe packet (chunk) format, called > pkt-line in git, separately from describing the contents of above > packet; either first pkt-line then command, or first command then > pkt-line. Otherwise we would be left with describing pkt-line format > many times, as it is done in current version of this chapter. > > > In git clients communicates with server using a packetized stream, > where each line (packet, chunk) is preceded by its length (including > the header) as a 4-byte hex number. A length of 'zero', i.e. packet > "0000" has a special meaning: it means end of stream / flush > connection. The "# flush ..." in description of client--server > exchange above is done using exactly "0000" packet. > > Footnote: this format somewhat reminds / resembles 'chunked' transfer > encoding used in HTTP[1], although there are differences. > http://en.wikipedia.org/wiki/Chunked_transfer_encoding > >> >> The request is processed and turned into a call to git-upload-pack: >> >> $ git-upload-pack /path/to/repos/project.git > > This is alternate place where we could tell about fetching via ssh:// > > We probably should tell where /path/to/repos that /project.git is > prefixed with comes from; it is from --base-path=/path/to/repos > argument to git-daemon (a sort of "GIT root"). > > BTW. (this is just a very minor nit) shouldn't we use FHS compliant > path, i.e. "/srv/git" instead of "/path/to/repos" (and follow RFC in > using "example.com" in place of "myserver.com")? > >> >> This immediately returns information of the repo: >> >> 007c74730d410fcb6603ace96f1dc55ea6196122532d HEAD\\000multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag\\n >> 003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\\n >> 003d5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/dist\\n >> 003e7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 refs/heads/local\\n >> 003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\\n >> 0000 > > I have added explicit LF terminators in the form of "\\n" (which would > render as "\n"), mainly because "0000" flush packed _doesn't_ have it. > Also I have added "include-tag", as modern git installations provide > this capability. > > Here is a dilemma: currently example output is provided almost exactly > as-is, only indented and with some quoting/escaping (\\000 or \\0 for > NUL character, \\n for LF, later \\001 and \\002 for 0x01 and 0x02 > bytes). To know if given example output is what client sends or what > server outputs, you have to read the narrative. Alternate solution > would be to use "C: " and "S: " prefixing (perhaps with some extra > format to make it more clear that it is not part of data), used in > pack-protocol.txt technical documentation, and proposed for describing > network protocols by some RFC (I don't remember which, unfortunately). > Which one to choose? > > > We would want, at some point, describe that first line of first > response from server contains 'stuffed' behind "\0" (NUL) space > separated list of capabilities our server supports. Those > capabilities would have to be described somewhere: as a sidebar, > or in a separate subsection, or in an appendix. > > Below there is (for completeness) list of git-upload-pack > capabilities, with short description of each: > > * multi_ack (for historical reasons not multi-ack) > > It allows the server to return "ACK $SHA1 continue" as soon as it > finds a commit that it can use as a common base, between the > client's wants and the client's have set. > > By sending this early, the server can potentially head off the > client from walking any further down that particular branch of the > client's repository history. > > See the thread for more details (posts by Shawn O. Pearce and by > Junio C Hamano). > > * thin-pack > > Server can send thin packs, i.e. packs which do not contain base > elements for some delta chains, if those base elements are > available on client side. Client has thin-pack capability when it > understand how to "thicken" them adding required delta bases, > making those packfiles independent. > > Of course it doesn't make sense for client to use (request) this > capability for git-clone... But if the client does request it (and > I think modern clients actually do request it, even on initial > clone case) the server won't produce a thin pack. Why? There is no > common base, so there is no uninteresting set to omit from the > pack. :-) > > * side-band > * side-band-64k > > This means that server can send, and client understand multiplexed > (muxed) progress reports and error info interleaved with the > packfile itself. > > These two options are mutually exclusive. A client should ask for > only one of them, and a modern client always favors side-band-64k. > If client ask for both, server uses side-band-64k. > > Older side-band allows only up to 1000 bytes per packet. > > * ofs-delta > > Server can send, and client understand PACKv2 with delta refering > to its base by position in pack rather than by SHA-1. Both can > send/read OBJ_OFS_DELTA, aka type 6 in a pack file. > > * shallow > > Server can send shallow clone (git clone --depth ...). > > * no-progress > > Client should use it if it was started with "git clone -q" or > something, and doesn't want that side brand 2. We still want > sideband 1 with actual data (packfile), and sideband 3 with error > messages. > > * include-tag > > If we pack an object to the client, and a tag points exactly at > that object, we pack the tag too. In general this allows a client > to get all new tags when it fetches a branch, in a single network > connection, instead of two (separate connection for tags). > > This capability is not to be used when client was called with > '--no-tags'. > >> >> Each line starts with a four byte line length declaration in hex. The >> section is terminated by a line length declaration of 0000. > > This repetition would not be necessary if pkt-line format had its own > description somewhere before. We would probably still want to remind > the reader that "0000" line length declaration means 'flush'. > >> >> This is sent back to the client verbatim. > > Hmmm... "sent back ... verbatim"? I wonder what did you want to say > here... > >> The client responds with another request: >> >> 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta\\n >> 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\\n >> 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\\n >> 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\\n >> 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\\n >> 0000 >> 0009done\\n > > Here again I added explicit LF terminator, and split off "0000" flush > packet in separate line, to make this request (well, two requests) > more clear. > > The first line of this request contains capabilities client wants to > use. It should be some subset of capabilities server supports. > >> >> The is sent to the open git-upload-pack process which then streams out >> the final response: > > "_The_ is send"? > > I would remove quotes around lines of server response below, but would > leave explicit \n for LF, and \\001 and \\002 for bytes 0x01 and 0x02 > denoting channel. > >> >> "0008NAK\n" > > This NAK means that server did not found [closed] set of common > ancestors. It is response to "0000" flush line ("had enough?" line) > from client. As the example is about git-clone, and client doesn't > _have_ any commits to show server as candidates for common ancestors > (calculation), it replies with "done" to get pack. > >> "0023\\002Counting objects: 2797, done.\n" > > This is a bit untypical example, as for larger repositories like Linux > kernel or even git repository, usually you would have much more > objects, and actually object enumeration would take more time. You > would see many > > "0020\\002Counting objects: 10662 \r" > "0020\\002Counting objects: 22318 \r" > "0020\\002Counting objects: 29506 \r" > > packets before > > "0023\\002Counting objects: 65058, done.\n" > >> "002b\\002Compressing objects: 0% (1/1177) \r" >> "002c\\002Compressing objects: 1% (12/1177) \r" >> "002c\\002Compressing objects: 2% (24/1177) \r" >> "002c\\002Compressing objects: 3% (36/1177) \r" >> "002c\\002Compressing objects: 4% (48/1177) \r" >> "002c\\002Compressing objects: 5% (59/1177) \r" >> "002c\\002Compressing objects: 6% (71/1177) \r" >> "0053\\002Compressing objects: 7% (83/1177) \rCompressing objects: 8% (95/1177) \r" >> ... >> "005b\\002Compressing objects: 100% (1177/1177) \rCompressing objects: 100% (1177/1177), done.\n" > > Sidenote: the reason why there is sometimes more than one line send in > a single packet / single pkt-line is buffering between git-pack-objects > which produces those messages to pipe, and git-upload-pack which reads > them and sends them to client. If pack-objects can write two messages > into the pipe buffer before upload-pack is woken to read them out, > upload-pack might find two (or more) messages ready to read without > blocking. These get bundled into a single packet, because, why not, > its easier to code it that way. > > Here or a little later we probably should explain (even though it is > fairly obvious), that final response from server is (here) in pkt-line > with sideband format, where first byte of data denotes channel > (stream) number: 1 for data, 2 for progress info, 3 for fatal errors. > >> "2004\\001PACK\\000\\000\\000\\002\\000\\000\n\\355\\225\\017x\\234\\235\\216K\n\\302"... >> "2005\\001\\360\\204{\\225\\376\\330\\345]z\226\273"... > > Here I think it would be enough to show only the fragment which is > packfile signature... > >> ... >> "0037\\002Total 2797 (delta 1799), reused 2360 (delta 1529)\n" >> ... >> "<\\276\\255L\\273s\\005\\001w0006\\001[0000" > > This line is I think is broken in wrong place. It is the tail > end of some packet (each packed begins with 4 characters wide 0-padded > length of chunk as hex number; "<\\276\\255L" does not match 4HEXDIG), > followed by "0000" 'flush' packet (here it signals end of stream). > >> >> See the Packfile chapter previously for the actual format of the >> packfile data in the response. >> >> > .... > -- > Jakub Narebski > Poland > -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html