Re: git pack/unpack over bittorrent - works!

Luke Kenneth Casson Leighton <luke.leighton@xxxxxxxxx> · Mon, 6 Sep 2010 14:23:48 +0100

On Mon, Sep 6, 2010 at 12:52 AM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:

>> another branch (which is the situation that, i believe, happens with
>> "git pull" over http:// or git://); ignoring the fact that i actually
>> implemented using the .idx file yesterday ... :)
>
> Please, let's get it slow.

 ack :)

> There are 2 concepts you really need to master in order to come up with
> a solution.  And those concepts are completely independent from
> each other, but at the moment you are blending them up together and
> that's not good.

 i kinda get it - but i realise that's not good enough: i need to be
able to _say_ i get it, in a way that satisfies you.

> The first one is all about object enumeration.  And object enumeration
> is all about 'git rev-list'.  This is important when offering objects to
> the outside world that you actually do offer _all_ the needed objects,
> but _only_ the needed objects.  If some objects are missing you get a
> broken repository.  But more objects can also be a security problem as
> those extra objects may contain confidential data that you never
> intended to publish.

 ack.

> And object enumeration has absolutely nothing to do with packs, nor .idx
> files for that matter.

 mmm packs not being to do with object enumeration i get.  i
understand that .idx files contain "lists of objects" which isn't the
same thing (and also happen to contain pointers/offsets to the objects
of its associated .pack)

 at some point i'd really like to know what the object list is (not
the objects themselves) that comes out of "git pack-objects --thin"
but my curiosity can wait.

> As I said, the objects you want might be split
> across multiple packs, and also in loose form, and also in some
> alternate location that is shared amongst many repositories on the same
> filesystem.

 ok - this tells me (and it's confirmed, below) that you're describing
the situation based on what can be found in .git - _not_ what comes
out of "git pack-objects".  i wouldn't _dream_ of digging around in a
.git/ location looking for packs or idx files, but because i have
mentioned them _without_ prefixing every mention with "the
custom-generated .idx and/or .pack as generated by git pack-objects",
you may have got the wrong impression, for which i apologise.

>  But a single pack may also contain more than what you want
> to offer, and it is extremely important that you do _not_ offer those
> objects that are not reachable from the branch you want to publish.
>
> Following me so far?

 yep :)

> The second concept is all about object _representation_ or _encoding_.
> That's where the deltas come into play.  So the idea is to grab the list
> of objects you want to publish, and then look into existing packs to see
> if you could find them in delta form.  So, for each object, if you do
> find them in delta form, and the objec the delta is made against is 1)
> also part of the list of objects you want to send, or 2) is already
> available at the remote end, then you may simply reuse that delta data
> as is from the pack.  Finding if a particular pack has the wanted object
> is easy: you just need to look it up in the .idx file.  Then, in the
> corresponding pack file you parse the object header to find out if it is
> a delta, and what its base object is.

 ok.  all of this makes sense - but it's enough for me to be able to
ask questions, rather than "do", if you know what i mean.

>>  ... there is a bit of a disadvantage to using pack index files that
>> it goes all the way down (if i am reading things correctly) and cannot
>> be told "give me just the objects related to a particular commit"....
>
> Exact.  The .idx file gives you a list of objects that exists in the
> corresponding pack.  That list of object might belong to a totally
> random number of random commits.  You may also have a random number of
> packs across which some or all objects are distributed.  Because, of
> course, not all the objects you need are always packed.
>
> So... I hope you understand now that there is no relation between
> commits and .idx files.  The only exception is when you do create a
> custom pack with 'git pack-objects'.

 yes.  ahh... that's what i've been doing: using "git pack-objects
--thin".  and the reason for that is because i've seen it used in the
http implementation of "git fetch".

 so, my questions up until now regarding .pack and .idx have all been
targetted at that, and based on that context, _not_ the packs+idx
files that are in .git/

>> > Try this instead:
>> >
>> >    git rev-list --objects HEAD | cut -c -40 | sort
>> >
>> > That will give you a sorted list of all objects reachable from the
>> > current branch.  With the Linux repo, you may replace "HEAD" with
>> > "v2.6.34..v2.6.35" if you wish, and that would give you the list of the
>> > new objects that were introduced between v2.6.34 and v2.6.35.
>>
>>  ... unlike this, which is in fact much more along the lines of what i
>> was looking for (minus the loveliness of the delta compression oh
>> well)
>
> Again, delta compression is a _separate_ issue.
>
>> > This will
>> > provide you with 84642 objects instead of the 1.7 million objects that
>> > the Linux repo contains (easier when testing stuff).
>>
>>  hurrah! :)  [but, then if you actually want to go back and get alll
>> commits, that's ... well, we'll not worry about that too much, given
>> the benefits of being able to get smaller chunks.]
>
> If you want all commits then you just need --all instead of HEAD.

 no, i want commits separated and individual and "compoundable".  the plan is:

* to get the ref associated with refs/heads/master
* to get the list of all commits associated with that master ref
* to work out how far local deviates from remote along that list of commits
* to get the objects which will make up the missing commits (if they
aren't already in the local store)
* to apply those commits in the correct order

in other words, the plan is to follow what git http fetch and/org git
git:// fetch does as much as possible (ok, perhaps not).

the reason for getting the objects individually (blobs etc.) should be
clear: prior commits _could_ have resulted in that exact object having
been obtained already.

so far i have implemented:

* get the master ref using git for-each-ref
* get the list of all commits using git rev-list
* enumerate the list of objects associated with an individual commit by:
    i) creating a CUSTOM pack+idx using git pack-objects {ref}
    ii) *parsing* the idx file using gitdb's FileIndex to get the list
of objects
    iii) transferring that list to the local machine
* requesting *individual* objects from the enumerated list out of the idx file
   by using a CUSTOM "git pack-objects --thin {ref} < {ref}" command

that's as far as i've got, before you mentioned that it would be
better to use "git rev-list --objects commit1..commit2" and to use
"git cat-file" to obtain the actual object [what's not clear in this
plan is how to store that cat'ed file at the local end, hence the
continued use of git pack-objects --thin {ref} < {ref}]

the prior implementation was to treat the custom pack-object as if it
was "the atomic leaf-node operation" instead of individual objects
(blobs, trees).

>> > That sorted list of objects is more or less what the pack index file
>> > contains, plus an offset in the pack for each entry.  It is used to
>> > quickly find the offset for a given object in the corresponding pack
>> > file, and the fanout is only a way to cut 3 iterations in the binary
>> > search.
>> >
>> > But anyway, what you want is really to select the precise set of objects
>> > you wish to share, and not blindly using the pack index file.  If you
>> > have a public branch and a private branch in your repository, then
>> > objects from both branches may end up in the same pack
>>
>>  slightly confused: are you of the belief that i intend to ignore
>> refs/branches/* starting points?
>
> I don't know what your exact understanding of Git is, and although I
> know one or two things about the Git storage model, I get confused
> myself by some of your comments, such as this one above.

 soorree.  i believe the source of the confusion is that you believed
that i intend to "blindly use a pack index file" as in "blindly go
rummaging around in .git/ at the remote end" when i have absolutely no
intention of doing so.

 what i _have_ been doing however is custom-generating pack-objects
and associated pack-indexes (just like git http fetch) _including_
using the --thin option because that's what git http fetch does.

 i believe that this results in the concerns that you raised (about
having access to unauthorised data) being dealt with.

>> > don't want to publish those objects from the private branch.
>>
>>  ahh, i wondered where i'd seen the bit about "confusing" two
>> branches, i thought it was in another message.  so many flying back &
>> forth :)  from what i can gather, this is exactly what happens with
>> git fetch from http:// or git:// so what's the big deal about that?
>> why stop gitp2p from benefitting from the extra compression that could
>> result from "borrowing" bits of another branch's objects, neh?
>
> No.  git:// will _never_ ever transfer any object that is not part of
> the published branch(es).

 ... because it uses, from what i can gather, git pack-objects --thin

> If an object that does get transmitted is
> actually a delta against an object that is only part of a branch that is
> not published, then the delta will be expanded and redone against
> another suitable object before transmission.

 and that's handled by git pack-objects --thin (am i right?)

 ok.

 so.  we have a hierarchical plan: get the commit list, get a
per-commit object-list, get the objects (if needed), store the
objects.

 problem: despite looking through virtually every single builtin/*.c
file which uses write_sha1_file (which i believe i have correctly
identified, from examining git unpack-objects, as being the function
which stores actual objects, including their type), i do not see a git
command (yet) which performs the reverse operation of "git cat-file".

builtin/apply.c - that's for patches
builtin/checkout.c - that's for the merge result.
builtin/notes.c - creating a note
builtin/tag.c - creating a tag
builtin/mktree.c - creating a tree object but *only* from a text listing

ok - maybe this is one of the ones that i need, but only if i use "git
cat-file -p" to pretty-print the output of tree objects but i don't
think that's a good idea.

what else...

cache_tree.c - nope.
commit.c - nope.
read-cache.c - beh? nope.  has blank args "", 0

so... um... unless i actually manually create a pack object (perhaps
using python-gitdb to construct it) out of the data obtained by "git
cat-file" i don't see how this would work.

l.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html