On Mon, Jul 31, 2023 at 07:46:18PM +0200, Han-Wen Nienhuys wrote: > On Sat, Mar 25, 2023 at 6:40 PM Jakub Narębski <jnareb@xxxxxxxxx> wrote: > > >>> Abhradeep promised[1] that he'd include some performance work in his > > >>> next version of that series. I think the main things we'd be interested > > >>> in are: > > >>> > > >>> - Does using Roaring provide a file-size advantage over > > >>> EWAH-compressed bitmaps? > > I modified JGit to write Roaring bitmaps instead of EWAH bitmaps. The > resulting difference in file sizes are small, and actually favor EWAH: > > $ ls -l {ewah-repos,roaring-repos}/*.git/objects/pack/*.bitmap > -r--r----- 1 hanwen primarygroup 26257386 Jul 31 15:04 > ewah-repos/android-pfb.git/objects/pack/pack-b14c35ec7fc3bb20884abe51a81c832be5983fdc.bitmap > -r--r----- 1 hanwen primarygroup 27621579 Jul 31 15:20 > roaring-repos/android-pfb.git/objects/pack/pack-b14c35ec7fc3bb20884abe51a81c832be5983fdc.bitmap > > -r--r----- 1 hanwen primarygroup 1037356 Jul 31 14:46 > ewah-repos/gerrit.git/objects/pack/pack-fe46c7f96a2910f5775a2ff3bef7e4fa0e779f91.bitmap > -r--r----- 1 hanwen primarygroup 1242608 Jul 31 14:45 > roaring-repos/gerrit.git/objects/pack/pack-fe46c7f96a2910f5775a2ff3bef7e4fa0e779f91.bitmap This was one of my hopes with Roaring+Run, too, but I think that it's a non-starter with our current object ordering for reachability bitmaps. I did the same experiment a few months ago in my fork of git.git, and consistently was able to produce smaller bitmaps when EWAH compressed as compared to Roaring+Run. I think the reason is that our bitmaps are pretty sparse, and so often have a lot of 0's, interspersed with a few 1's. Depending on container alignment, a single 1's bit surrounded by zeros will often get encoded as a length 0 "run" container. This means that instead of storing that information in a single bit like we would with EWAH, we use two 16-bit unsigned values to store (a) the position[^1], and (b) length of the run. The length is entirely wasted space, since the bytes are zeros, and storing the position is significantly less efficient than storing a sparse bit position in EWAH. I haven't proved conclusively one way or the other where Roaring+Run is significantly faster than EWAH or vice-versa. There are some cases where the former is a clear winner, and other cases where it's the latter. In any event, my extremely WIP patches to make this mostly work are available here: https://github.com/ttaylorr/git/compare/tb/roaring-bitmaps One thing that I was able to do to produce slightly smaller Roaring+Run encoded .bitmap files is to store some of the bitmaps as XORs against earlier bitmaps, similar to what we do with EWAH. Often XORing the raw bits, and then compressing that with Roaring+Run can be significantly smaller than storing another full Roaring+Run bitset. That's the only part that I've had trouble getting to make it consistently work, and I haven't had time to get back to it, so it's been collecting dust since the end of May. Thanks, Taylor [^1]: Not the bit position, exactly, but rather the 16 least-significant bits of the bit position, since all values in the same container share the 16 most-significant bits being a multiple of 2^16.