"Abhradeep Chakraborty via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> > > Roaring bitmaps are said to be more efficient (most of the time) than > ewah bitmaps. So Git might gain some optimization if it support roaring > bitmaps. As Roaring library has all the changes it needed to implement > roaring bitmaps in Git, Git can learn to write roaring bitmaps. However, > all the changes are backward-compatible. > > Teach Git to write roaring bitmaps. That is way underexplained. At least cover what the plans are, so that readers do not have to ask these questions: * When is the choice of bitmap type is made? Is it fixed at repository initialization time and once chosen other kinds cannot be used? * Is the bitmap file self describing? How does a reader know between ewah and roaring codepaths to use to read a given bitmap file? Is there enough room for extending the set of bitmap formats, or we cannot add other formats easily? > Mentored-by: Taylor Blau <me@xxxxxxxxxxxx> > Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@xxxxxxxxx> > Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> > --- > Makefile | 1 + > bitmap.c | 225 +++++++++++++++++++++++++++ > bitmap.h | 33 ++++ > builtin/diff.c | 10 +- > ewah/bitmap.c | 61 +++++--- > ewah/ewok.h | 37 ++--- > pack-bitmap-write.c | 326 ++++++++++++++++++++++++++++++---------- > pack-bitmap.c | 114 +++++++------- > pack-bitmap.h | 22 ++- > t/t5310-pack-bitmaps.sh | 17 +++ > 10 files changed, 664 insertions(+), 182 deletions(-) > create mode 100644 bitmap.c > create mode 100644 bitmap.h > > diff --git a/Makefile b/Makefile > index e9537951105..9ca19b3ca8d 100644 > --- a/Makefile > +++ b/Makefile > @@ -900,6 +900,7 @@ LIB_OBJS += archive.o > LIB_OBJS += attr.o > LIB_OBJS += base85.o > LIB_OBJS += bisect.o > +LIB_OBJS += bitmap.o > LIB_OBJS += blame.o > LIB_OBJS += blob.o > LIB_OBJS += bloom.o > diff --git a/bitmap.c b/bitmap.c > new file mode 100644 > index 00000000000..7d547eb9f53 > --- /dev/null > +++ b/bitmap.c > @@ -0,0 +1,225 @@ > +#include "bitmap.h" > +#include "cache.h" > + > +static enum bitmap_type bitmap_type = INIT_BITMAP_TYPE; "INIT" is a strange name for "UNINITIALIZED". Especially ... > +void *roaring_or_ewah_bitmap_init(void) > +{ > + switch (bitmap_type) > + { (Style) > + case EWAH: > + return ewah_new(); > + case ROARING: > + return roaring_bitmap_create(); > + default: ... here, you use it to mean exactly that. > + error(_("bitmap type not initialized\n")); > + return NULL; Do you really need the global variable that holds the bitmap type? Wouldn't it be easier to write code that needs to deal with both types (e.g. in a repository with existing ewah bitmap, you want to do a repack and index the result using the roaring bitmap) if you passed the type through the callchain as a parameter? It may be that the codepath that reads from an existing bitmap file says "ah, the file given to us seems to be in format X (either EWAH or ROARING or perhaps something else), so let's call bitmap_init(X) to obtain the in-core data structure to deal with that file". When that happens, you may probably need to have two cases in the default: arm of this switch statement, i.e. one to diagnose a BUG() to pass an uninitialized bitmap type to the codepath, and the other to diagnose a runtime error() to have read a bitmap file whose format this version of Git does not understand. > +void *roaring_or_raw_bitmap_copy(void *bitmap) > +{ > + switch (bitmap_type) > + { > + case EWAH: > ... > +int roaring_or_ewah_bitmap_set(void *bitmap, uint32_t i) > +{ > + switch (bitmap_type) { > + case EWAH: > +... > +void roaring_or_raw_bitmap_set(void *bitmap, uint32_t i) > +{ > + switch (bitmap_type) > + { > + case EWAH: > +... > +void roaring_or_raw_bitmap_unset(void *bitmap, uint32_t i) > +{ > + switch (bitmap_type) > + { > + case EWAH: > +... These repetitive patterns makes me wonder if void *bitmap is a good type to be passing around. Shouldn't it be a struct with its first member being a bitmap_type, and another member being what these functions are passing to the underlying bitmap format specific functions as "bitmap"? E.g. void bitmap_unset(struct bitmap *bm, uint32_t i) { switch (bm->type) { case EWAH: ewah_bitmap_remove(bm->u.ewah, i); break; ... > \ No newline at end of file Careful. > diff --git a/bitmap.h b/bitmap.h > new file mode 100644 > index 00000000000..d75400922cc > --- /dev/null > +++ b/bitmap.h > @@ -0,0 +1,33 @@ > +#ifndef __BITMAP_H__ > +#define __BITMAP_H__ > + > + > +#include "git-compat-util.h" > +#include "ewah/ewok.h" > +#include "roaring/roaring.h" > + > +enum bitmap_type { > + INIT_BITMAP_TYPE = 0, "UNINITIALIZED_BITMAP_TYPE", probably. > +void *roaring_or_ewah_bitmap_init(void); I would strongly suggest reconsider these names. What if you later want to add the third variant? roaring_or_ewah_or_xyzzy_bitmap_init()? Instead just use the most generic name, like "bitmap_init", perhaps something along the lines of ... struct bitmap { enum bitmap_type type; union { struct ewah_bitmap *ewah; struct roaring_bitmap *roaring; } u; }; struct bitmap *bitmap_new(enum bitmap_type type) { struct bitmap *bm = xmalloc(sizeof(*bm)); bm->type = type; switch (bm->type) { case EWAH: bm->u.ewah = ewah_new(); break; case ROARING: bm->u.roaring = roaring_bitmap_create(); break; default: die(_("unknown bitmap type %d"), (int)type); } return bm; } I dunno.