On Wed, Oct 10, 2018 at 09:58:51AM +0900, Junio C Hamano wrote: > > +static void bitmap_to_rle(struct strbuf *out, struct bitmap *bitmap) > > +{ > > + int curval = 0; /* count zeroes, then ones, then zeroes, etc */ > > + size_t run = 0; > > + size_t word; > > + size_t orig_len = out->len; > > + > > + for (word = 0; word < bitmap->word_alloc; word++) { > > + int bit; > > + > > + for (bit = 0; bit < BITS_IN_EWORD; bit++) { > > + int val = !!(bitmap->words[word] & (((eword_t)1) << bit)); > > + if (val == curval) > > + run++; > > + else { > > + strbuf_add_varint(out, run); > > + curval = 1 - curval; /* flip 0/1 */ > > + run = 1; > > + } > > + } > > OK. I find it a bit disturbing to see that the loop knows a bit too > much about how "struct bitmap" is implemented, but that is a complaint > against the bitmap API, not this new user of the API. Heh, again, this is not really meant to be production code. I'm not at all happy about inventing a new compressed bitmap format here, and I'd want to investigate the state of the art a bit more. In particular, the worst case here is quite bad, and I wonder if there are formats that can select the best encoding when writing a bitmap (naive RLE when it's good, something else other times). I also suspect part of why this does better is that other formats are optimized less for our case. We really don't care about setting or looking at a few bits part way through a bitmap. Our bitmaps are small enough that we don't mind streaming through a whole one. It's just that we have so _many_ of them that we want to be meticulous about wasted bytes. Whatever format we choose, I think it would become part of the bitmap.c file, and internal details would be OK to access there. I just put it here to keep the patch simple. > We do not try to handle the case where bitmap has bits that is not > multiple of BITS_IN_EWORD and instead pretend that size of such a > bitmap can be rounded up, because we ignore trailing 0-bit anyway, > and we know the "struct bitmap" would pad with 0-bit at the tail? Right. We do not know the "real" number of zero bits at all. It's just assumed that there are infinite zeroes trailing off the end (and this is how "struct bitmap" works, since it is the one that does not bother to keep a separate size pointer). > > + /* > > + * ugh, varint does not seem to have a way to prevent reading past > > + * the end of the buffer. We'll do a length check after each one, > > + * so the worst case is bounded. > > + */ > > Sorry about that :-). :) We may want to address that. I know we did some hardening about reading off the end of .pack and .idx files. But it seems like any user of decode_varint() may read up to 16 bytes past the end of a buffer. We seem to only use them for the $GIT_DIR/index, though. Anybody with a "struct hashfile" result at least has a 20-byte trailer we can accidentally read from. But I wouldn't be surprised if there's a way to trick it in practice. -Peff