Re: reftable [v2]: new ref storage format

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 17 Jul 2017 11:53:23 -0700

On Mon, Jul 17, 2017 at 8:01 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:

> A ref block is written as:
>
>     'r'
>     uint24 ( block_len )
>     ref_record*
>     uint32( restart_offset )*
>     uint16( number_of_restarts )
>     padding?
>
...
>
> The 2-byte `number_of_restarts + 1` stores the number of entries in
> the `restart_offset` list.

This means uint16( number_of_restarts ) cannot be 0, but has to be 1
to indicate no restarts?

Why do we need to be non-NUL here, but the padding is all NUL?

When starting to write a block, we need to know exactly how large
the ref_records* and restart offsets need to be to put the
number_of_restarts at the position as promised via block_len.
This sounds complicated unless I missed the obvious.

Going by this, would it rather make sense to omit the block_len
and then scan backwards from *block_size-1 to find the first non-NUL
and that will be the number_of_restarts?

Or we could allow additional padding between ref_record and
restart_offsets, such that the writer implementation has wiggle room
for the restarting logic.

>
> #### log record
>
> Log record keys are structured as:
>
>     ref_name '\0' reverse_int32( time_sec )
>
> where `time_sec` is the update time in seconds since the epoch.

The epoch ends 2038, which is in 21 years. (Did you just volunteer
to fixup the time issues in 20 years?)
If possible I'd prefer a date to be encoded with more range available.

>  The
> `reverse_int32` function inverses the value so lexographical ordering
> the network byte order time sorts more recent records first:
>
>     reverse_int(int32 t) {
>       return 0xffffffff - t;
>     }
>
> Log records have a similar starting structure to ref and index
> records, utilizing the same prefix compression scheme applied to the
> key described above.

The ref names itself are compressed, would we also want to compress
the timing information? The time_sec could be a varint indicating a delta
to the previous change of a ref, or fixed to the epoch if it is the first change
of that ref.

Just to be clear, the ordering here would be

  refs/heads/maint <number>
  refs/heads/maint <smaller number>
  ...
  refs/heads/master <number>
  refs/heads/master <smaller number>

such that refs that have more than one entry in its reflog in a given
refstable file, would have perfect prefix compression for the ref?

> ### Update transactions
>
> Although reftables are immutable, mutations are supported by writing a
> new reftable and atomically appending it to the stack:
>
> 1. Atomically create `stack.lock`
> 2. Copy current stack to `stack.lock`.
> 3. Prepare new reftable in temp file `.refXXXXXXX`.
>    Include log entries.
> 4. Rename (3) to `${sha1}.ref`.
> 5. Append `${sha1}.ref` to `stack.lock`
> 6. Atomically rename `stack.lock` to `stack`.

In case 3.+4. becomes time consuming, it can be prepared outside
the lock, such that inside the lock we'd only need to check
for contention of refs? For example if I'd want to update one ref and
another write wants to update another refs, we'd both be preparing
the a new refstable containing one ref and log, then one of us obtains
the lock and writes. The second writer would then need to inspect
the delta of the stack and see if any ref that they want to change
was touched.

> ### Compaction
>
> A partial stack of reftables can be compacted by merging references
> using a straightforward merge join across reftables, selecting the
> most recent value for output, and omitting deleted references that do
> not appear in remaining, lower reftables.
>
> For sake of illustration, assume the stack currently consists of
> reftable files (from oldest to newest): A, B, C, and D. The compactor
> is going to compact B and C, leaving A and D alone.
>
> 1.  Obtain lock `stack.lock` and read the `stack` file.
> 2.  Obtain locks `B.lock` and `C.lock`.
>     Ownership of these locks prevents other processes from trying
> to compact these files.
> 3.  Release `stack.lock`.
> 4.  Compact `B` and `C` in temp file `.refXXXXXXX`.
> 5.  Reacquire lock `stack.lock`.
> 6.  Verify that `B` and `C` are still in the stack, in that order. This
>     should always be the case, assuming that other processes are adhering
>     to the locking protocol.
> 7.  Rename `.refXXXXXXX` to `X`.
> 8.  Write the new stack to `stack.lock`, replacing `B` and `C` with `X`.
> 9.  Atomically rename `stack.lock` to `stack`.
> 10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
>     readers to backtrack.
>
> This strategy permits compactions to proceed independently of updates.

10. could be deferred to gc as well. auto gc would need to learn about
the number
of loose refstables in that case.

Thanks,
Stefan