Re: reftable [v2]: new ref storage format

Shawn Pearce <spearce@xxxxxxxxxxx> · Mon, 17 Jul 2017 12:04:34 -0700

On Mon, Jul 17, 2017 at 11:53 AM, Stefan Beller <sbeller@xxxxxxxxxx> wrote:
> On Mon, Jul 17, 2017 at 8:01 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>
>> A ref block is written as:
>>
>>     'r'
>>     uint24 ( block_len )
>>     ref_record*
>>     uint32( restart_offset )*
>>     uint16( number_of_restarts )
>>     padding?
>>
> ...
>>
>> The 2-byte `number_of_restarts + 1` stores the number of entries in
>> the `restart_offset` list.
>
> This means uint16( number_of_restarts ) cannot be 0, but has to be 1
> to indicate no restarts?

A block must have at least one restart in it, the first ref_record
must be a restart. So number_of_restarts in the tail of the block can
be 0, which implies 1 restart (number_of_restarts + 1), and the first
restart is required at the first ref_record. :)

> When starting to write a block, we need to know exactly how large
> the ref_records* and restart offsets need to be to put the
> number_of_restarts at the position as promised via block_len.
> This sounds complicated unless I missed the obvious.

Correct. The writer needs to compute the block size before it writes
the block. It does so by buffering the block contents until its
approximately full, then fixes block_len, and flushes the block.

> Going by this, would it rather make sense to omit the block_len
> and then scan backwards from *block_size-1 to find the first non-NUL
> and that will be the number_of_restarts?

Not quite. On small reftable files the "physical" block may be shared
with a log block ('g'). We need to be able to reliably find the of the
ref block ('r'), without padding between the two blocks.

> Or we could allow additional padding between ref_record and
> restart_offsets, such that the writer implementation has wiggle room
> for the restarting logic.

I had that in an older format description, and removed it. Placing the
padding at the end of the block was simpler for the reader and writer
implementation to handle.

>> Log record keys are structured as:
>>
>>     ref_name '\0' reverse_int32( time_sec )
>>
>> where `time_sec` is the update time in seconds since the epoch.
>
> The epoch ends 2038, which is in 21 years. (Did you just volunteer
> to fixup the time issues in 20 years?)
> If possible I'd prefer a date to be encoded with more range available.

Good point. However, I think in 20 years we'll see 2 more hash
functions for Git, and we can bump reftable to v2 and expand the field
to 8 bytes.

> The ref names itself are compressed, would we also want to compress
> the timing information?

The time field is also prefix compressed as part of the ref name's
prefix compression. So there is no need to move to the complexity of a
varint or anything else.

>> ### Update transactions
>>
>> Although reftables are immutable, mutations are supported by writing a
>> new reftable and atomically appending it to the stack:
>>
>> 1. Atomically create `stack.lock`
>> 2. Copy current stack to `stack.lock`.
>> 3. Prepare new reftable in temp file `.refXXXXXXX`.
>>    Include log entries.
>> 4. Rename (3) to `${sha1}.ref`.
>> 5. Append `${sha1}.ref` to `stack.lock`
>> 6. Atomically rename `stack.lock` to `stack`.
>
> In case 3.+4. becomes time consuming, it can be prepared outside
> the lock, such that inside the lock we'd only need to check
> for contention of refs? For example if I'd want to update one ref and
> another write wants to update another refs, we'd both be preparing
> the a new refstable containing one ref and log, then one of us obtains
> the lock and writes. The second writer would then need to inspect
> the delta of the stack and see if any ref that they want to change
> was touched.

Excellent point, it reduces the contention window for non-conflicting
writes. I will update this section with your input, thank you Stefan.