On 6/15/2018 1:31 PM, Jeff King wrote:
On Fri, Jun 15, 2018 at 09:41:46AM -0700, Junio C Hamano wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:
On 6/14/2018 11:44 PM, Jeff King wrote:
The return value of ewah_read_mmap() is now an ssize_t,
since we could (in theory) process up to 32GB of data. This
would never happen in practice, but a corrupt or malicious
.bitmap or index file could convince us to do so.
Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2
(signed) or 4 GB (unsigned). Is there something specifically in the
bitmap format that stops at this size?
The proposed log message for 1/3 has this bit
- check the size for integer overflow (this should be
impossible on 64-bit, as the size is given as a 32-bit
count of 8-byte words, but is possible on a 32-bit
system)
4 Giga 8-byte words makes 32 Giga bytes, I'd guess.
Yes, exactly. It's definitely an odd size. I think using the same
varints we use in the packfile would be more efficient (and they're
already variable-length records), though we tend to have few enough of
these that it probably doesn't matter.
I think a more useful change for the bitmap format would be some kind of
index. IIRC, right now readers have to linearly scan the whole file in
order to use a single bitmap.
With Stolee's commit-metadata files, we could potentially move to
storing reachability bitmaps as a data element there. But there are two
complications:
- the bitmaps themselves are dependent on having an ordered list of
objects (one per bit) to talk about. And that's why they're
integrated with packfiles, since that provides a constrained universe
with a set ordering.
- the existing storage isn't entirely independent between bitmaps. Some
of them are basically "deltas" off of nearby bitmaps.
The VSTS reachability bitmap is different from the Git bitmap in a few ways.
1. It uses Roaring+Run bitmaps [1] instead of EWAH bitmap. This format
is simpler (in my opinion) in several ways, especially in how it splits
the bitmap into 16-bit address ranges and compresses each on its own.
This makes set containment queries much faster, as we can navigate
directly to the region that is important (and then binary search if that
chunk is an "array" or "run" chunk). I say this is simpler because I can
explain the entire compression format to you in five minutes and a
whiteboard. The paper [2] is a simple enough read (start at the "Roaring
Bitmap" section at the end of page 4).
2. Our file uses a chunk-based approach similar to the commit-graph
file: one chunk is simply a list of the commits that have bitmaps.
Another chunk is the raw binary data of all bitmaps concatenated
together. The last chunk connects the other two chunks by a) providing
an offset into the binary data for the bitmap at that commit, and b) the
commit that provides a delta base for the bitmap.
If we are considering changing the reachability bitmap, then I'm very
intrigued. I think the number one thing to consider is to use the
multi-pack-index as a reference point (with a stable object order) so
the objects can be repacked independently from the reachability bitmap
computation. If we are changing the model at that level, then it is
worth thinking about other questions, like how we index the file or how
we compress the bitmaps.
Thanks,
-Stolee
[1] https://roaringbitmap.org/about/
[2] https://arxiv.org/abs/1603.06549
Consistently faster and smaller compressed bitmaps with Roaring