On Thu, Mar 07, 2024 at 04:26:38PM +0100, Patrick Steinhardt wrote: > Records store their keys prefix-compressed. As many records will share a > common prefix (e.g. "refs/heads/"), this can end up saving quite a bit > of disk space. The downside of this is that it is not possible to just > seek into the middle of a block and consume the corresponding record > because it may depend on prefixes read from preceding records. > > To help with this usecase, the reftable format writes every n'th record > without using prefix compression, which is called a "restart". The list > of restarts is stored at the end of each block so that a reader can > figure out entry points at which to read a full record without having to > read all preceding records. > > This allows us to do a binary search over the records in a block when > searching for a particular key by iterating through the restarts until > we have found the section in which our record must be located. From > thereon we perform a linear search to locate the desired record. > > This mechanism is broken though. In `block_reader_seek()` we call > `binsearch()` over the count of restarts in the current block. The > function we pass to compare records with each other computes the key at > the current index and then compares it to our search key by calling > `strbuf_cmp()`, returning its result directly. But `binsearch()` expects > us to return a truish value that indicates whether the current index is > smaller than the searched-for key. And unless our key exactly matches > the value at the restart counter we always end up returning a truish > value. > > The consequence is that `binsearch()` essentially always returns 0, > indicacting to us that we must start searching right at the beginning of > the block. This works by chance because we now always do a linear scan > from the start of the block, and thus we would still end up finding the > desired record. But needless to say, this makes the optimization quite > useless. > > Fix this bug by returning whether the current key is smaller than the > searched key. As the current behaviour was correct it is not possible to > write a test. Furthermore it is also not really possible to demonstrate > in a benchmark that this fix speeds up seeking records. > > This may cause the reader to question whether this binary search makes > sense in the first place if it doesn't even help with performance. But > it would end up helping if we were to read a reftable with a much larger > block size. Blocks can be up to 16MB in size, in which case it will > become much more important to avoid the linear scan. We are not yet > ready to read or write such larger blocks though, so we have to live > without a benchmark demonstrating this. > > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > --- > reftable/block.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/reftable/block.c b/reftable/block.c > index 72eb73b380..3d7a7022e7 100644 > --- a/reftable/block.c > +++ b/reftable/block.c > @@ -302,7 +302,7 @@ static int restart_key_less(size_t idx, void *args) > > result = strbuf_cmp(&a->key, &rkey); > strbuf_release(&rkey); > - return result; > + return result <= 0; > } > > void block_iter_copy_from(struct block_iter *dest, struct block_iter *src) > -- > 2.44.0 > Hum. I noticed that this causes a memory leak in our CI. I'll need to investigate. Patrick
Attachment:
signature.asc
Description: PGP signature