Re: Bloom Filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/9/2018 7:12 PM, Jeff King wrote:
On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:

Hmph. It really sounds like we could do better with a custom RLE
solution. But that makes me feel like I'm missing something, because
surely I can't invent something better than the state of the art in a
simple thought experiment, right?

I know what I'm proposing would be quite bad for random access, but my
impression is that EWAH is the same. For the scale of bitmaps we're
talking about, I think linear/streaming access through the bitmap would
be OK.
Thinking on it more, what I was missing is that for truly dense random
bitmaps, this will perform much worse. Because it will use a byte to say
"there's one 1", rather than a bit.

But I think it does OK in practice for the very sparse bitmaps we tend
to see in this application.  I was able to generate a complete output
that can reproduce "log --name-status -t" for linux.git in 32MB. But:

   - 15MB of that is commit sha1s, which will be stored elsewhere in a
     "real" system

   - 5MB of that is path list (which should shrink by a factor of 10 with
     prefix compression, and is really a function of a tree size less
     than history depth)

So the per-commit cost is not too bad. That's still not counting merges,
though, which would add another 10-15% (or maybe more; their bitmaps are
less sparse).

I don't know if this is a fruitful path at all or not. I was mostly just
satisfying my own curiosity on the bitmap encoding question. But I'll
post the patches, just to show my work. The first one is the same
initial proof of concept I showed earlier.

   [1/3]: initial tree-bitmap proof of concept
   [2/3]: test-tree-bitmap: add "dump" mode
   [3/3]: test-tree-bitmap: replace ewah with custom rle encoding

  Makefile                    |   1 +
  t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
  2 files changed, 345 insertions(+)
  create mode 100644 t/helper/test-tree-bitmap.c
I'm trying to test this out myself, and am having trouble reverse engineering how I'm supposed to test it.

Looks like running "t/helper/test-tree-bitmap gen" will output a lot of binary data. Where should I store that? Does any file work?

Is this series just for the storage costs, assuming that we would replace all TREESAME checks with a query into this database? Or do you have a way to test how much this would improve a "git log -- <path>" query?

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux