Re: git pack/unpack over bittorrent - works!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2 Sep 2010, Luke Kenneth Casson Leighton wrote:

> On Thu, Sep 2, 2010 at 4:33 PM, A Large Angry SCM <gitzilla@xxxxxxxxx> wrote:
> > On 09/02/2010 09:37 AM, Luke Kenneth Casson Leighton wrote:
> >>
> >> On Wed, Sep 1, 2010 at 11:04 PM, Nguyen Thai Ngoc Duy<pclouds@xxxxxxxxx>
> >>  wrote:
> >
> > [...]
> >>>
> >>> There were discussions whether a pack is stable enough to
> >>> be shared like this,
> >>
> >>  it seems to be.  as long as each version of git produces the exact
> >> same pack object, off of the command "git pack-objects --all --stdout
> >> --thin {ref}<  {objref}"
> >
> > This is not guaranteed.
> 
>  ok.  greeeat.
> 
>  so, some sensible questions:
> 
>  * what _can_ be guaranteed?

You can guarantee that if the SHA1 name of different packs is the same 
then they contain the same set of objects.  Obviously their packed 
encoding will be different, and even the pack sizes might be quite 
different too.

>  * diffs?

Again that depends.  Over the evolution of Git, its diff library was 
modified resulting in slightly different but valid equivalent diff 
outputs.

>  * git-format-patches? (which i am aware can do binary files and also
> rms)?

Same as above.

> * individual files in the .git/objects directory?

Well, even then you can't guarantee they will be identical from one 
system to another.  That may depend on the zlib library version used for 
example.

>  and, asking perhaps some silly questions:
> 
> * why is it not guaranteed?

Because it doesn't need to.

> * under what circumstances is it not guaranteed?  and, crucially, is
> it necessary to care?   i.e. if someone does a shallow git clone, i
> couldn't give a stuff.

Like I said, even repeating some repacking on the same machine with same 
input is likely to produce slightly different packs because of 
threading.  This is because the work set is divided between threads, and 
since thread scheduling is not deterministic then some threads might not 
have the same amount of CPU cycles given to them in relation with the 
other threads.  And when a thread is done with its work set, it will go 
and steal half of the work set from another thread with the most 
amount of work 
still left.  This has the effect of changing the delta pairing outcome 
on the workset edges.

> * is it possible to _make_ the repository guaranteed to produce
> identical pack objects?

Sure, but performance will suck.

> * does for example "git gc" change the object store in such a way such
> that one git repo will produce a different pack-object from the same
> ref?  if so, can running "git gc" prior to producing the pack-objects
> gurantee that the pack-objects will be the same?

No.  The gc operation will combine multiple small packs into one and try 
to reuse as much data from those existing packs as possible without 
recomputing it.  So you'll end up reusing whatever delta pairing you 
were given from your peer the last time you cloned a repo or fetched an 
update.  And of course that clone/fetch was the result of a pack 
combining operation on the sending end which itself tried to reuse as 
much of the existing data from different packs without recomputing it 
too.  Only the edges between different packs will be delta compressed in 
those cases, using the particular heuristics that happen to be 
implemented in the involved Git versions. So you may end up with a 
totally different pack content containing data segments that originated 
from wildly random places on the net.

The only way to get a bit-for-bit reproducible pack one one specific 
system is to use 'git repack' with the -f switch, and limit it to only 
one thread.

> * is it a versioning issue?  is it because there are different
> versions (2 and 3)?  if so, that's ok, you just force people to use
> the same pack-object versions.

Not at all.  FYI version 3 never was actually deployed so there is 
effectively only version 2 in play.  There are "features" such as 
OFS_DELTA that are negotiated when a pack is transferred over the git 
protocol and if the receiver doesn't advertise them then the sender will 
convert them on the fly into a compatible form.

But as the actual pack bitstream goes, it is totally unstable for all 
the reasons I've stated so far.  Of course, Git being distributed must 
rely on some stable and universal representation of object content, 
hence their SHA1 references.  But their encoding doesn't have to be when 
all peers can cope with all the variations.

I'm sorry as this isn't going to help you much unfortunately.










> 
> etc. etc.
> 
> l.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]