Re: Achieving efficient storage of weirdly structured repos

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Linus,

On Apr 4, 2008, at 4:57 PM, Linus Torvalds wrote:
On Fri, 4 Apr 2008, Roman Shaposhnik wrote:

That turned out to be a perfect suggestion. Thank you. I'm now the
happiest camper ever. And I'm also also pretty dumbfounded ;-)

Ok, the dumbfounded we can help with.

Great!

Here's what happened.

I started with a a repository filled with "loose" (one object per file)
objects (the reason I needed it was for the ease of sleuthing through
individual objects and it was created by git-unpack-objects from that
initial 1.1Gb pack). And I tried to pack it exactly like you
suggested:
$ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects

Well, that's not exactly like I suggested, that's a *really* old- fashioned
way.

It sure is old-fashioned. My only excuse is that since git repack is a shell
script around git-pack-objects I've always felt comfortable just using
the lower level utility.

But the part you left off was what the "objects" file contained?

It was pretty much the result of (cd .git/objects ; find . -type f | tr -d './')
Omitting that was quite silly on my part, but I was really convinced
that because of the internal sorting the original ordering of objects
wouldn't matter at all.

In particular, "git pack-objects" can take just a raw list of objects, and
it will *work*, but without the naming information and without the
ordering information on the objects, the end result will generally suck.

So how did you generate the "objects" file? You can do it by just listing every single loose object you have, and it will work, but the end result
won't be very pretty.

So it seems that my list of objects was different from what git-rev- list/setup_revisions()
would have provided in two ways:
    1. the order of objects was arbitrary
    2. the naming of blobs and trees was missing
#2 was, indeed, a huge oversight on my part. At the same time, I've always thought
that #1 shouldn't matter, because, as you pointed out, the objects get
sorted by <type, namehash, length> anyway. However, it seems that because of the lack of naming there was much less sorting done by type_size_sort() and the original order persevered (at least within type-size partitions).
It did matter after all!

Now, in my particular case, the ordering was braindead and it resulted in
a highly visible inefficiency. On the other hand, it seems that creative
ordering could very well be used to exploit localities which go beyond
how default name hashing in builtin-pack-objects.c works.

In fact, it starts to look awfully like custom MPEG profiles where if you know
your footage you can achieve a much higher compression ratio
compared to what the default might give you. It also makes it very
clear that Git's approach of doing away with per-file content tracking
is quite superior to the in-file deltas. Cool!

Here's my final question on that issue: wouldn't it be great to give users a direct control over specifying the list of objects in exactly the order they would like them to be tried for deltifying? Something like -- preserve-order
option available for git-pack-objects?

    Total 1159628 (delta 614516), reused 0 (delta 0)

Ie you got a much better 60% delta ratio, and obviously a much smaller
pack. But it's not just smaller, because the resulting pack will also have
the objects in topological order, so that objects that are "new"
(ie more closely reachable from the top-of-tree) will be at the front of the pack, so you'll also generally have better IO patterns in the packfile
itself.

Got it! This makes total sense.

Is there any documentation that describes the heuristics involved in
creating a pack?

It's been explained a few times on the mailing list, but I don't know if
there is some write-up.

The git pack format is a bit more relaxed than any other SCM format I know about, since it literally is just a "bunch of objects with deltas randomly between them". So a pack can look just about any way it wants - there are no real ordering requirements (--delta-base-offset does require that the delta follows the base, but even that is really just a trivial practical "because otherwise you could never know what the offset in the pack- file
was" issue rather than anything else).

So you can order the objects and make deltas just about any way you want. What "git repack" will do is to sort objects by <type, namehash, length>, and then walk the list with the window of the specified size to see the
best pairing it can do.

The "namehash" is just designed so that files that have the same name sort
together (and it's also not a real "hash" - it's really a meaningless
number, but one that is designed so that even if the names aren't exactly
the same, they sort closer together if they end in similar character
sequences - so a file that moves from one directory to another but keeps
the same basename will sort source and destination together).

See "type_size_sort()" in builtin-pack-objects.c to see what's up.

(The preferred_base thing is just to keep objects we actually want to
*keep* in the pack separated from the objects we may be using for
references - but that is only used for transferring objects over the wire,
never for standalone packs).

Oh. And now I notice that there is *some* documentation on this in

	Documentation/technical/pack-heuristics.txt

but that's pretty old (not that I think it has really changed) and not
very deep. Just taken from some #irc discussion.

P.S. Oh, and here's one extra tiny question that I also have: what
does the output:
   Total 1159628 (delta 614516), reused 0 (delta 0)
really mean?

It just means that there were a million objects total, the pack was
created with 614k of them as deltas against other objects, and none of
them were re-used from previous packs.

You'll see that when you do incremental repacks, 99% of all the objects
and deltas get re-used from the old pack, which is how we can make
repacking fairly efficient. That's imporant because the native git
protocol format itself is just a pack-file (plus the handshake to figure out what both sides have, of course). So a "git clone" implies a repack.

Very nice summary! Many, many thanks for providing it!

Thanks,
Roman.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux