On Fri, Jun 15, 2018 at 09:41:46AM -0700, Junio C Hamano wrote: > Derrick Stolee <stolee@xxxxxxxxx> writes: > > > On 6/14/2018 11:44 PM, Jeff King wrote: > >> The return value of ewah_read_mmap() is now an ssize_t, > >> since we could (in theory) process up to 32GB of data. This > >> would never happen in practice, but a corrupt or malicious > >> .bitmap or index file could convince us to do so. > > > > Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2 > > (signed) or 4 GB (unsigned). Is there something specifically in the > > bitmap format that stops at this size? > > The proposed log message for 1/3 has this bit > > - check the size for integer overflow (this should be > impossible on 64-bit, as the size is given as a 32-bit > count of 8-byte words, but is possible on a 32-bit > system) > > 4 Giga 8-byte words makes 32 Giga bytes, I'd guess. Yes, exactly. It's definitely an odd size. I think using the same varints we use in the packfile would be more efficient (and they're already variable-length records), though we tend to have few enough of these that it probably doesn't matter. I think a more useful change for the bitmap format would be some kind of index. IIRC, right now readers have to linearly scan the whole file in order to use a single bitmap. With Stolee's commit-metadata files, we could potentially move to storing reachability bitmaps as a data element there. But there are two complications: - the bitmaps themselves are dependent on having an ordered list of objects (one per bit) to talk about. And that's why they're integrated with packfiles, since that provides a constrained universe with a set ordering. - the existing storage isn't entirely independent between bitmaps. Some of them are basically "deltas" off of nearby bitmaps. -Peff