Re: reftable [v3]: new ref storage format

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 27 Jul 2017 07:28:04 -0700

Thanks for your continued work on this.

I have some comments about v3 (inline below).

On Sat, Jul 22, 2017 at 11:29 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> 3rd iteration of the reftable storage format.
>
> [...]
> ### Objectives
>
> - Near constant time lookup for any single reference, even when the
>   repository is cold and not in process or kernel cache.
> - Near constant time verification a SHA-1 is referred to by at least
>   one reference (for allow-tip-sha1-in-want).
> - Efficient lookup of an entire namespace, such as `refs/tags/`.
> - Support atomic push `O(size_of_update)` operations.
> - Combine reflog storage with ref storage.

I think that the optimization enabled by obj_records is only
interesting for server-side repositories that will frequently be
fetched from, and which additionally have allow-tip-sha1-in-want
turned on, right? So could we make it explicitly optional?

> [...]
> #### obj record
>
> An `obj_record` describes a single object abbreviation, and the blocks
> containing references using that unique abbreviation:
>
>     varint( prefix_length )
>     varint( (suffix_length << 3) | cnt_3 )
>     suffix
>     varint( cnt_rest )?
>     varint( block_delta )+
>
> Like in reference blocks, abbreviations are prefix compressed within
> an obj block.  On large reftable files with many unique objects,
> higher block sizes (64k), and higher restart interval (128), a
> `prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in
> obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits).
>
> Each record contains `block_count` number of block identifiers for ref
> blocks.  The `block_count` is determined by:
>
>     block_count = cnt_3
>     if (cnt_3 == 0x7) {
>       block_count += cnt_rest
>     }
>
> The `cnt_rest` field is only present when `block_count >= 0x7` and
> could overflow the `cnt_3` field available in the record start.  This
> encoding scheme is used as the vast majority of abbreviations are
> only one reference (or at most a few), and unlikely to exceed 6 blocks.

I agree that this `cnt_3` handling is overly cute. There will never be
records with `block_count == 0`, will there? Then I propose that
`cnt_3` be set to zero if `block_count` is greater than 7, in which
case the full block count is stored as a varint. It's simpler, and I
think the only `block_count`s for which this scheme costs a byte more
than your scheme are implausible: 128-134, 16383-16390, etc.

> [...]
> ### Log block format

I'm still not happy about how you store reflogs. Here are my objections:

* You store reflogs by appending a high-resolution time to the
refname. This is awkward:
  * There is always some uncertainty about uniqueness, even with a
high-resolution clock. If you want to eliminate that uncertainty, you
have to read the last entry in the reflog for the reference *for every
update* to be sure that your timestamp is not already "in use".
  * There is a need to invent microsecond values on systems with
low-resolution clocks, again to ensure uniqueness.
  * If there is clock skew, then either you lose information about the
*order* of reference updates (which I would consider unacceptable), or
you might have to invent fictional times for new updates to retain the
order relative to previous updates. But inventing fictional times is
very problematic: suppose somebody's clock is incorrectly set to the
year 2020 when they make a commit. As far as Git is concerned, that is
plausible, so the reflog entry gets that timestamp. Then the user
corrects the clock and makes another commit. We would either be forced
to rewrite the old reflog entry(ies) or to give a timestamp in 2020 to
the new update.
* With reftable storage, multiple-reference updates can be done
atomically. But the reflogs don't retain the information about which
references were updated together.
* Even if all reflogs are retained, it is not possible to deduce the
global state of all references at a moment in the past.

I propose to solve all of the above problems, plus some others, as follows:

Let's keep an integer `update_index`, which starts at zero and
increases by one each atomic reference update (i.e., it increases by
one even if the update touches multiple references).

Let's store reflog entries under keys `(refname, update_index)`, and
store the timestamp of the entry *within* the record rather than as
part of the key. The tuple `(refname, update_index)` has to be encoded
somehow that ensures the desired order; see below. This encoding
ensures that the order of reference updates is well-defined regardless
of clock skew, and that one can use the reflog to determine which
references were updated together and what the complete state of
references was after any update (assuming that the user retains all
reflogs).

Let's store in the header of each reftable file the `min_update_index`
and `max_update_index` that are covered by the file. Do this such that
the indexes chain together properly for the reftables in the stack;
i.e. `reftable[n].min_update_index == reftable[n-1].max_update_index +
1`. This has two purposes:
* The current `update_index` can be read quickly from the header of
the top reftable in the stack. When you want to create a new reference
update, you can set its `update_index` to that value plus one and know
that it is unique.
* If the table of contents should somehow get lost, the order of the
reftable files can be reconstructed from their `update_index` ranges.

If a range of reftable files is compacted together, the new file gets
its `min_update_index` from the oldest file being compacted and its
`max_update_index` from the newest file. If reflogs are expired, that
doesn't affect the `min_update_index` or `max_update_index` for the
reftables even if some of the `update_index`es in that range no longer
appear in the file.

How to encode the `(refname, update_index)` tuple so that it sorts the
way we want, namely with entries grouped by refname and backwards in
time? The simplest way would be `refname + \0 + uint32(0xffffffff -
update_index)`, and that should prefix-compress relatively well. One
could squeeze that down a bit by using a custom sort function, but
that would complicate the code, and anyway, zlib hides all sins,
doesn't it?

I'm also still unhappy that completely packing references down to one
reftable will require all reflogs to be fully rewritten, even though
the reflogs will mostly be append-only. And I'm worried that your
comforting numbers about how well the reflogs compress won't apply to
GitHub, because we store lots of information in our reflog messages. I
think that some small tweaks would improve the situation.

Suppose one could create `reflog-only` reftables. Then a `pack-refs`
implementation could choose to create one or more `reflog-only`
reftables at the bottom of the stack for accumulating reflogs (such
files would rarely have to be compacted), and on top of that a big
reftable holding most of the reference values, and on top of that
multiple small reftables containing both reference values and reflogs
(for recent updates). Perhaps one would even record in the table of
contents which files in the stack contain what kinds of data, to avoid
(for example) having to open reflog-only files at all if you are only
interested in reference values.

This would require a small amendment to my `update_index` suggestion
above, namely that reflog-only and ref-only reftables might have
overlapping `update_index` ranges, but the end result should be that
the files together cover the whole range of `update_index`es for refs
and also for reflogs.

Michael