Re: [PATCH] index-pack: --clone-bundle option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 03, 2016 at 02:57:08PM -0800, Junio C Hamano wrote:

> Teach a new option "--clone-bundle" to "git index-pack" to create a
> split bundle file that uses an existing packfile as its data part.
> 
> The expected "typical" preparation for helping initial clone would
> start by preparing a packfile that contains most of the history and
> add another packfile that contains the remainder (e.g. the objects
> that are only reachable from reflog entries).  The first pack can
> then be used as the data part of a split bundle and these two files
> can be served as static files to bootstrap the clients without
> incurring any more CPU cycles to the server side.

Just FYI, because I know our setup is anything but typical, but this
probably _won't_ be what we do at GitHub. We try to keep everything in a
single packfile that contains many "islands", which are not allowed to
delta between each other[1]. Individual forks of a repo get their own
islands, unreachable objects are in another one, and so forth. So we'd
probably never want to provide a whole packfile, as it has way too much
data. But we _would_ be able to take a bitmap over the set of packed
objects such that if you blindly spew out every object with its bit set,
the resulting pack has the desired objects.

That's not as turn-key for serving with a dumb http server, obviously.
And I don't expect you to write that part. I just wanted to let you know
where I foresee having to take this for GitHub to get it deployed.

[1] It's a little more complicated than "don't delta with each other".
    Each object has its own bitmap of which islands it is in, and the
    rule is that you can use a delta base iff it is in a subset of your
    own islands. The point is that a clone of a particular island should
    never have to throw away on-disk deltas.

    Cleaning up and sharing that code upstream has been on my todo list
    for about 2 years, but somehow there's always new code to be
    written. :-/ I'm happy to share if people want to look at the messy
    state.

> Among the objects in the packfile, the ones that are not referenced
> by no other objects are identified and recorded as the "references"

s/no/any/ (double negative)

> in the resulting bundle.  As the packfile does not record any ref
> information, however, the names of the "references" recorded in the
> bundle need to be synthesized; we arbitrarily choose to record the
> object whose name is $SHA1 as refs/objects/$SHA1.

Makes sense. In an "island" world I'd want to write one bitmap per
island, and then reconstruct that list on the fly by referencing the
island bitmap with the sha1 names in the pack revidx.

> +static void write_bundle_file(const char *filename, unsigned char *sha1,
> +			      const char *packname)
> +{
> +	int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0600);
> +	struct strbuf buf = STRBUF_INIT;
> +	struct stat st;
> +	int i;
> +

We should probably bail here if "fd < 0", though I guess technically we
will notice in write_in_full(). :)

> +	if (stat(packname, &st))
> +		die(_("cannot stat %s"), packname);
> +
> +	strbuf_addstr(&buf, "# v3 git bundle\n");
> +	strbuf_addf(&buf, "size: %lu\n", (unsigned long) st.st_size);
> +	strbuf_addf(&buf, "sha1: %s\n", sha1_to_hex(sha1));
> +	strbuf_addf(&buf, "data: %s\n\n", packname);

This "sha1" field is the sha1 of the packfile, right? If so, I think
it's redundant with the pack trailer found in the pack at "packname".
I'd prefer not to include that here, because it makes it harder to
generate these v3 bundle headers dynamically (you have to actually
checksum the pack subset to come up with sha1 up front, as opposed to
checksumming as you stream the pack out).

It _could_ be used as an http etag, but I think it would make more sense
to push implementations toward using a unique value for the "packname"
(i.e., so that fetching "https://example.com/foo/1234abcd.pack"; is
basically immutable). And I think that comes basically for free, because
the packname has that same hash in it already.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]