Re: reftable [v6]: new ref storage format

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Fri, 18 Aug 2017 11:24:18 +0200

On Wed, Aug 16, 2017 at 12:47 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> On Mon, Aug 14, 2017 at 5:13 AM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
>> On 08/07/2017 03:47 AM, Shawn Pearce wrote:
>>> 6th iteration of the reftable storage format.
> [...]
>>> #### index record
>>>
>>> An index record describes the last entry in another block.
>>> Index records are written as:
>>>
>>>     varint( prefix_length )
>>>     varint( (suffix_length << 3) | 0 )
>>>     suffix
>>>     varint( block_position )
>>>
>>> Index records use prefix compression exactly like `ref_record`.
>>>
>>> Index records store `block_position` after the suffix, specifying the
>>> absolute position in bytes (from the start of the file) of the block
>>> that ends with this reference.
>>
>> Is there a reason that the index lists the *last* refname that is
>> contained in a block rather than the *first* refname? I can't think of a
>> reason to choose one vs. the other, but your choice was initially
>> surprising. I don't think it matters either way; I was just curious.
>
> Yes, there is a reason. When a reader is searching the index block and
> discovers a key that is greater than their search needle, they are now
> sitting on a record with the block_position for that greater key. By
> using the *last* refname the current block_position is the one to seek
> to.
>
> If instead we used *first* refname, the reader would now have to
> backtrack to the prior index record to get the block_position out of
> that record. Or it has to keep a running "prior_position" local
> variable.
>
> Using last simplifies the reader's code.

Ah, OK. I was thinking of this as being a binary search, in which case
you *must* see both bracketing records before you are done, and the
chances are 50-50 which one you see first. But this search is a little
bit different, because the index records within a restart block have
to be scanned linearly. So it is much more likely that you see the
"before" record followed by the "after" record.

Thanks for the explanation.

Michael