On Mon, Apr 6, 2009 at 11:14 AM, Nicolas Pitre <nico@xxxxxxx> wrote: > On Mon, 6 Apr 2009, Jon Smirl wrote: > >> On Mon, Apr 6, 2009 at 10:19 AM, Nicolas Pitre <nico@xxxxxxx> wrote: >> > On Mon, 6 Apr 2009, Jon Smirl wrote: >> > >> >> First thing an initial clone does is copy all of the pack files from >> >> the server to the client without even looking at them. >> > >> > This is a no go for reasons already stated many times. There are >> > security implications (those packs might contain stuff that you didn't >> > intend to be publically accessible) and there might be efficiency >> > reasons as well (you might have a shared object store with lots of stuff >> > unrelated to the particular clone). >> >> How do you deal with dense history packs? These packs take many hours >> to make (on a server class machine) and can be half the size of a >> regular pack. Shouldn't there be a way to copy these packs intact on >> an initial clone? It's ok if these packs are specially marked as being >> ok to copy. > > [sigh] > > Let me explain it all again. > > There is basically two ways to create a new pack: the intelligent way, > and the bruteforce way. > > When creating a new pack the intelligent way, what we do is to enumerate > all the needed object and look them up in the object store. When a > particular object is found, we create a record for that object and note > in which pack it is located, at what offset in that pack, how much space > it occupies in its compressed form within that pack, , and if whether it > is a delta or not. When that object is indeed a delta (the majority of > objects usually are) then we also keep a pointer on the record for the > base object for that delta. > > Next, for all objects in delta form which base object is also part of > the object enumeration and obviously part of the same pack, we simply > flag those objects as directly reusable without any further processing. > This means that, when those objects are about to be stored in the new > pack, their raw data is simply copied straight from the original pack > using the offset and size noted above. In other words, those objects > are simply never redeltified nor redeflated at all, and all the work > that was previously done to find the best delta match is preserved with > no extra cost. Does this process cause random reads all over a 2GB pack file? Busy servers can't keep a 2GB pack in memory. sendfile() the 2GB pack to client is way more efficient. (assuming the pack is marked as being ok to send). > > Of course, when your repository is tightly packed into a single pack, > then all enumerated objects fall into the reusable category and > therefore a copy of the original pack is indeed sent over the wire. > One exception is with older git clients which don't support the delta > base offset encoding, in which case the delta reference encoding is > substituted on the fly with almost no cost (this is btw another reason > why a dumb copy of existing pack may not work universally either). But > in the common case, you might see the above as just the same as if git > did copy the pack file because it really only reads some data from a > pack and immediately writes that data out. > > The bruteforce repacking is different because it simply doesn't concern > itself with existing deltas at all. It instead start everything from > scratch and perform the whole delta search all over for all objects. > This is what takes lots of resources and CPU cycles, and as you may > guess, is never used for fetch/clone requests. > > > Nicolas > -- Jon Smirl jonsmirl@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html