On 9/19/2022 1:47 PM, Abhradeep Chakraborty via GitGitGadget wrote: > Git currently uses ewah bitmaps ( which are based on run-length encoding) to > compress bitmaps. Ewah bitmaps stores bitmaps in the form of run-length > words i.e. instead of storing each and every bit, it tries to find > consecutive bits (having same value) and replace them with the value bit and > the range upto which the bit is present. It is simple and efficient. But one > downside of this approach is that we have to decompress the whole bitmap in > order to find the bit of a certain position. > > For small (or medium sized) bitmaps, this is not an issue. But it can be an > issue for large (or extra large) bitmaps. In that case roaring bitmaps are > generally more efficient[1] than ewah itself. Some benchmarks suggests that > roaring bitmaps give more performance benefits than ewah or any other > similar compression technique. > > This patch series is currently in RFC state and it aims to let Git use > roaring bitmaps. As this is an RFC patch series (for now), the code are not > fully accurate (i.e. some tests are failing). But it is backward-compatible > (tests related to ewah bitmaps are passing). Some commit messages might need > more explanation and some commits may need a split (specially the one that > implement writing roaring bitmaps). Overall, the structure and code are near > to ready to make the series a formal patch series. > > I am submitting it as an RFC (after discussions with mentors) because the > GSoC coding period is about to end. I will continue to work on the patch > series. I look forward to your next version. I hope to see some information about the performance characteristics across the two versions. Specifically: 1. How do various test in t/perf/ change between the two formats? 2. For certain test repos (git/git, torvalds/linux, etc.) how much does the .bitmap file change in size across the formats? > Makefile | 3 + > bitmap.c | 225 + > bitmap.h | 33 + ... > ewah/bitmap.c | 61 +- > ewah/ewok.h | 37 +- ... > roaring/roaring.c | 20047 +++++++++++++++++++++++++++++++++++ > roaring/roaring.h | 1028 ++ I wonder if there is value in modifying the structure of these files into a bitmap/ directory and then perhaps ewah/ and roaring/ within each? Just a thought. Thanks, -Stolee