Re: What is the status of GSoC 2022 work on making Git use roaring bitmaps?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 31, 2023 at 07:46:18PM +0200, Han-Wen Nienhuys wrote:
> On Sat, Mar 25, 2023 at 6:40 PM Jakub Narębski <jnareb@xxxxxxxxx> wrote:
> > >>> Abhradeep promised[1] that he'd include some performance work in his
> > >>> next version of that series. I think the main things we'd be interested
> > >>> in are:
> > >>>
> > >>>   - Does using Roaring provide a file-size advantage over
> > >>>     EWAH-compressed bitmaps?
>
> I modified JGit to write Roaring bitmaps instead of EWAH bitmaps. The
> resulting difference in file sizes are small, and actually favor EWAH:
>
> $ ls -l {ewah-repos,roaring-repos}/*.git/objects/pack/*.bitmap
> -r--r----- 1 hanwen primarygroup 26257386 Jul 31 15:04
> ewah-repos/android-pfb.git/objects/pack/pack-b14c35ec7fc3bb20884abe51a81c832be5983fdc.bitmap
> -r--r----- 1 hanwen primarygroup 27621579 Jul 31 15:20
> roaring-repos/android-pfb.git/objects/pack/pack-b14c35ec7fc3bb20884abe51a81c832be5983fdc.bitmap
>
> -r--r----- 1 hanwen primarygroup  1037356 Jul 31 14:46
> ewah-repos/gerrit.git/objects/pack/pack-fe46c7f96a2910f5775a2ff3bef7e4fa0e779f91.bitmap
> -r--r----- 1 hanwen primarygroup  1242608 Jul 31 14:45
> roaring-repos/gerrit.git/objects/pack/pack-fe46c7f96a2910f5775a2ff3bef7e4fa0e779f91.bitmap

This was one of my hopes with Roaring+Run, too, but I think that it's a
non-starter with our current object ordering for reachability bitmaps.

I did the same experiment a few months ago in my fork of git.git, and
consistently was able to produce smaller bitmaps when EWAH compressed as
compared to Roaring+Run. I think the reason is that our bitmaps are
pretty sparse, and so often have a lot of 0's, interspersed with a few
1's.

Depending on container alignment, a single 1's bit surrounded by zeros
will often get encoded as a length 0 "run" container. This means that
instead of storing that information in a single bit like we would with
EWAH, we use two 16-bit unsigned values to store (a) the position[^1],
and (b) length of the run.

The length is entirely wasted space, since the bytes are zeros, and
storing the position is significantly less efficient than storing a
sparse bit position in EWAH.

I haven't proved conclusively one way or the other where Roaring+Run is
significantly faster than EWAH or vice-versa. There are some cases where
the former is a clear winner, and other cases where it's the latter.

In any event, my extremely WIP patches to make this mostly work are
available here:

  https://github.com/ttaylorr/git/compare/tb/roaring-bitmaps

One thing that I was able to do to produce slightly smaller Roaring+Run
encoded .bitmap files is to store some of the bitmaps as XORs against
earlier bitmaps, similar to what we do with EWAH. Often XORing the raw
bits, and then compressing that with Roaring+Run can be significantly
smaller than storing another full Roaring+Run bitset.

That's the only part that I've had trouble getting to make it
consistently work, and I haven't had time to get back to it, so it's
been collecting dust since the end of May.

Thanks,
Taylor

[^1]: Not the bit position, exactly, but rather the 16 least-significant
  bits of the bit position, since all values in the same container share
  the 16 most-significant bits being a multiple of 2^16.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux