On Thu, Mar 03, 2016 at 02:57:08PM -0800, Junio C Hamano wrote: > Teach a new option "--clone-bundle" to "git index-pack" to create a > split bundle file that uses an existing packfile as its data part. > > The expected "typical" preparation for helping initial clone would > start by preparing a packfile that contains most of the history and > add another packfile that contains the remainder (e.g. the objects > that are only reachable from reflog entries). The first pack can > then be used as the data part of a split bundle and these two files > can be served as static files to bootstrap the clients without > incurring any more CPU cycles to the server side. Just FYI, because I know our setup is anything but typical, but this probably _won't_ be what we do at GitHub. We try to keep everything in a single packfile that contains many "islands", which are not allowed to delta between each other[1]. Individual forks of a repo get their own islands, unreachable objects are in another one, and so forth. So we'd probably never want to provide a whole packfile, as it has way too much data. But we _would_ be able to take a bitmap over the set of packed objects such that if you blindly spew out every object with its bit set, the resulting pack has the desired objects. That's not as turn-key for serving with a dumb http server, obviously. And I don't expect you to write that part. I just wanted to let you know where I foresee having to take this for GitHub to get it deployed. [1] It's a little more complicated than "don't delta with each other". Each object has its own bitmap of which islands it is in, and the rule is that you can use a delta base iff it is in a subset of your own islands. The point is that a clone of a particular island should never have to throw away on-disk deltas. Cleaning up and sharing that code upstream has been on my todo list for about 2 years, but somehow there's always new code to be written. :-/ I'm happy to share if people want to look at the messy state. > Among the objects in the packfile, the ones that are not referenced > by no other objects are identified and recorded as the "references" s/no/any/ (double negative) > in the resulting bundle. As the packfile does not record any ref > information, however, the names of the "references" recorded in the > bundle need to be synthesized; we arbitrarily choose to record the > object whose name is $SHA1 as refs/objects/$SHA1. Makes sense. In an "island" world I'd want to write one bitmap per island, and then reconstruct that list on the fly by referencing the island bitmap with the sha1 names in the pack revidx. > +static void write_bundle_file(const char *filename, unsigned char *sha1, > + const char *packname) > +{ > + int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0600); > + struct strbuf buf = STRBUF_INIT; > + struct stat st; > + int i; > + We should probably bail here if "fd < 0", though I guess technically we will notice in write_in_full(). :) > + if (stat(packname, &st)) > + die(_("cannot stat %s"), packname); > + > + strbuf_addstr(&buf, "# v3 git bundle\n"); > + strbuf_addf(&buf, "size: %lu\n", (unsigned long) st.st_size); > + strbuf_addf(&buf, "sha1: %s\n", sha1_to_hex(sha1)); > + strbuf_addf(&buf, "data: %s\n\n", packname); This "sha1" field is the sha1 of the packfile, right? If so, I think it's redundant with the pack trailer found in the pack at "packname". I'd prefer not to include that here, because it makes it harder to generate these v3 bundle headers dynamically (you have to actually checksum the pack subset to come up with sha1 up front, as opposed to checksumming as you stream the pack out). It _could_ be used as an http etag, but I think it would make more sense to push implementations toward using a unique value for the "packname" (i.e., so that fetching "https://example.com/foo/1234abcd.pack" is basically immutable). And I think that comes basically for free, because the packname has that same hash in it already. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html