Re: [PATCH 0/9] Prefix-compress on-disk index entries

David Barr <davidbarr@xxxxxxxxxx> · Wed, 2 May 2012 14:26:15 +1000

On Wed, May 2, 2012 at 11:58 AM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
> On Fri, Apr 6, 2012 at 3:41 PM, David Barr <davidbarr@xxxxxxxxxx> wrote:
>> On Thu, Apr 5, 2012 at 4:44 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>> Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes:
>>>
>>>> On Wed, Apr 4, 2012 at 5:53 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>>> ...
>>>> I wonder what causes user time drop from .29s to .13s here. I think
>>>> the main patch should increase computation, even only slightly, not
>>>> less.
>>>
>>> The main patch reduced the amount of the data needs to be sent to the
>>> machinery to checksum and write to disk by about 45%, saving both I/O
>>> and computation.
>>
>> I hacked together a quick patch to try predictive coding the other
>> fields of the index. I got a further 34% improvement in size over
>> this series. Patches to come. I just used the previous cache entry as
>> the predictor and reused varint.h together with zigzag encoding[1].
>>
>> That's a total improvement in size over v2 of 62%.
>
> Have you posted (and I missed) the patches? I'm interested in seeing
> what changes you made.

I haven't posted anything - my proof of concept was write-only and slow.

I added a prelude with a bitmask that describes which fields differ
with the previous entry.

For each differing field, I encoded something like:
diff := this - prev;
zigzag := (diff << 1) ^ (diff >> 31)
raw := zigzag - 1 /* zero impossible because of mask */
write_varint(raw)

I also experimented with using unique sha1 prefixes but it was slow
and probably introduces race conditions.

>> [1] https://developers.google.com/protocol-buffers/docs/encoding#types
--
David Barr
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html