Re: Using bitmaps to accelerate fetch and clone

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 27, 2012 at 11:22 AM, Jeff King <peff@xxxxxxxx> wrote:
>
> I think clients will also want it. If we can make "git rev-list
> --objects --all" faster (which this should be able to do), we can speed
> up "git prune", which in turn is by far the slowest part of "git gc
> --auto", since in the typical case we are only incrementally packing.

Yes, the bitmap can also accelerate prune. We didn't implement this
but it is a trivial use of the existing bitmap.

>> > The sha1 in the filename makes sure that the reachability file is always
>> > in sync with the actual pack data and index.
>>
>> Depending on the extension dependencies, you may need to also use the
>> trailer SHA-1 from the pack file itself, like the index does. E.g. the
>> bitmap data depends heavily on object order in the pack and is invalid
>> if you repack with a different ordering algorithm, or a different
>> delta set of results from delta compression.
>
> Interesting. I would have assumed it depended on order in the index.

No. We tried that. Assigning bits by order in index (aka order of
SHA-1s sorted) results in horrible compression of the bitmap itself
because of the uniform distribution of SHA-1. Encoding instead by pack
order gets us really good bitmap compression, because object graph
traversal order tends to take reachability into account. So we see
long contiguous runs of 1s and get good compression. Sorting by SHA-1
just makes the space into swiss cheese.

> I think you are still OK, though, because
> the filename comes from the sha1 over the index file, which in turn
> includes the sha1 over the packfile. Thus any change in the packfile
> would give you a new pack and index name.

No. The pack file name is composed from the SHA-1 of the sorted SHA-1s
in the pack. Any change in compression settings or delta windows or
even just random scheduling variations when repacking can cause
offsets to slide, even if the set of objects being repacked has not
differed. The resulting pack and index will have the same file names
(as its the same set of objects), but the offset information and
ordering is now different.

Naming a pack after a SHA-1 is a fun feature. Naming it after the
SHA-1 of the object list was a mistake. It should have been named
after the SHA-1 in the trailer of the file, so that any single bit
modified within the pack stream itself would have caused a different
name to be used on the filesystem. But alas this is water under the
bridge and not likely to change anytime soon.

>> Yes. One downside is these separate streams aren't removed when you
>> run git repack. But this could be fixed by  a modification to git
>> repack to clean up additional extensions with the same pack base name.
>
> I don't think that's a big deal. We already do it with ".keep" files. If
> you repack with an older version of git, you may have a stale
> supplementary file wasting space. But that's OK. The next time you gc
> with a newer version of git, we could detect and clean up such stale
> files (we already do so for tmp_pack_* files).

Yes, obviously.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]