On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote: > > I wonder if Roaring does better here. > > In these sparse cases, usually Roaring will organize the data as "array > chunks" which are simply lists of the values. The thing that makes this > still compressible is that we store two bytes per entry, as the entries are > grouped by a common most-significant two bytes. SInce you say ~120k unique > paths, the Roaring bitmap would have two or three chunks per bitmap (and > those chunks could be empty). The overhead to store the chunk positions, > types, and lengths does come at a cost, but it's more like 32 bytes _per > commit_. Hmph. It really sounds like we could do better with a custom RLE solution. But that makes me feel like I'm missing something, because surely I can't invent something better than the state of the art in a simple thought experiment, right? I know what I'm proposing would be quite bad for random access, but my impression is that EWAH is the same. For the scale of bitmaps we're talking about, I think linear/streaming access through the bitmap would be OK. > > So at any rate, I do think it would not be out of the question to store > > bitmaps like this. I'm much more worried about the maintenance cost of > > adding new entries incrementally. I think it's only feasible if we give > > up sorting, and then I wonder what other problems that might cause. > The patch below gives me a starting point to try the Bloom filter approach > and see what the numbers are like. You did all the "git" stuff like > computing the changed paths, so thanks! Great, I hope it can be useful. I almost wrote it as perl consuming the output of "log --format=%h --name-only", but realized I didn't have a perl ewah implementation handy. You'll probably want to tweak this part: > > + prepare_revision_walk(&revs); > > + while ((commit = get_revision(&revs))) { > > + data.commit = commit; > > + diff_tree_combined_merge(commit, 0, &revs); > > + } ...to handle merges in a particular way. This will actually ignore merges totally. You could add "-m" to the revision arguments to get a per-parent diff, but of course you'd see those in your callback individually. If you want to do _just_ the first parent diff, I think you'll have to pick it apart manually, like: while ((commit = get_revision(&revs))) { struct object_id *parent_oid; /* ignore non-first parents, but handle root commits like --root */ if (commit->parents) parent = &commit->parents->item->object.oid; else parent = the_hash_algo->empty_tree; diff_tree_oid(parent, &commit->oid, ...); } -Peff