Re: reftable [v3]: new ref storage format

Shawn Pearce <spearce@xxxxxxxxxxx> · Fri, 28 Jul 2017 15:12:23 -0700

On Thu, Jul 27, 2017 at 7:28 AM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
> On Sat, Jul 22, 2017 at 11:29 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>> 3rd iteration of the reftable storage format.
>>
>> [...]
>> ### Objectives
>>
>> - Near constant time lookup for any single reference, even when the
>>   repository is cold and not in process or kernel cache.
>> - Near constant time verification a SHA-1 is referred to by at least
>>   one reference (for allow-tip-sha1-in-want).
>> - Efficient lookup of an entire namespace, such as `refs/tags/`.
>> - Support atomic push `O(size_of_update)` operations.
>> - Combine reflog storage with ref storage.
>
> I think that the optimization enabled by obj_records is only
> interesting for server-side repositories that will frequently be
> fetched from, and which additionally have allow-tip-sha1-in-want
> turned on, right? So could we make it explicitly optional?

Correct. Its already optional. Maybe I'm just not clear enough about
that in the document. I agree its not required in the file, and
writers can choose to omit obj_record.

>> #### obj record
>>
>> An `obj_record` describes a single object abbreviation, and the blocks
>> containing references using that unique abbreviation:
>>
>>     varint( prefix_length )
>>     varint( (suffix_length << 3) | cnt_3 )
>>     suffix
>>     varint( cnt_rest )?
>>     varint( block_delta )+
>>
> [...]
>> Each record contains `block_count` number of block identifiers for ref
>> blocks.  The `block_count` is determined by:
>>
>>     block_count = cnt_3
>>     if (cnt_3 == 0x7) {
>>       block_count += cnt_rest
>>     }
>>
>> The `cnt_rest` field is only present when `block_count >= 0x7` and
>> could overflow the `cnt_3` field available in the record start.  This
>> encoding scheme is used as the vast majority of abbreviations are
>> only one reference (or at most a few), and unlikely to exceed 6 blocks.
>
> I agree that this `cnt_3` handling is overly cute. There will never be
> records with `block_count == 0`, will there? Then I propose that
> `cnt_3` be set to zero if `block_count` is greater than 7, in which
> case the full block count is stored as a varint. It's simpler, and I
> think the only `block_count`s for which this scheme costs a byte more
> than your scheme are implausible: 128-134, 16383-16390, etc.

That is a great idea, thanks. I've updated the draft to reflect your
approach of cnt_3 = 0 indicates the actual count follows in a varint.

>> [...]
>> ### Log block format
>
> I'm still not happy about how you store reflogs. Here are my objections:
>
> * You store reflogs by appending a high-resolution time to the
> refname. This is awkward:
>   * There is always some uncertainty about uniqueness, even with a
> high-resolution clock. If you want to eliminate that uncertainty, you
> have to read the last entry in the reflog for the reference *for every
> update* to be sure that your timestamp is not already "in use".
>   * There is a need to invent microsecond values on systems with
> low-resolution clocks, again to ensure uniqueness.
>   * If there is clock skew, then either you lose information about the
> *order* of reference updates (which I would consider unacceptable), or
> you might have to invent fictional times for new updates to retain the
> order relative to previous updates. But inventing fictional times is
> very problematic: suppose somebody's clock is incorrectly set to the
> year 2020 when they make a commit. As far as Git is concerned, that is
> plausible, so the reflog entry gets that timestamp. Then the user
> corrects the clock and makes another commit. We would either be forced
> to rewrite the old reflog entry(ies) or to give a timestamp in 2020 to
> the new update.
> * With reftable storage, multiple-reference updates can be done
> atomically. But the reflogs don't retain the information about which
> references were updated together.
> * Even if all reflogs are retained, it is not possible to deduce the
> global state of all references at a moment in the past.
>
> I propose to solve all of the above problems, plus some others, as follows:
>
> Let's keep an integer `update_index`, which starts at zero and
> increases by one each atomic reference update (i.e., it increases by
> one even if the update touches multiple references).
>
> Let's store reflog entries under keys `(refname, update_index)`, and
> store the timestamp of the entry *within* the record rather than as
> part of the key. The tuple `(refname, update_index)` has to be encoded
> somehow that ensures the desired order; see below. This encoding
> ensures that the order of reference updates is well-defined regardless
> of clock skew, and that one can use the reflog to determine which
> references were updated together and what the complete state of
> references was after any update (assuming that the user retains all
> reflogs).
>
> Let's store in the header of each reftable file the `min_update_index`
> and `max_update_index` that are covered by the file. Do this such that
> the indexes chain together properly for the reftables in the stack;
> i.e. `reftable[n].min_update_index == reftable[n-1].max_update_index +
> 1`. This has two purposes:
> * The current `update_index` can be read quickly from the header of
> the top reftable in the stack. When you want to create a new reference
> update, you can set its `update_index` to that value plus one and know
> that it is unique.
> * If the table of contents should somehow get lost, the order of the
> reftable files can be reconstructed from their `update_index` ranges.
>
> If a range of reftable files is compacted together, the new file gets
> its `min_update_index` from the oldest file being compacted and its
> `max_update_index` from the newest file. If reflogs are expired, that
> doesn't affect the `min_update_index` or `max_update_index` for the
> reftables even if some of the `update_index`es in that range no longer
> appear in the file.
>
> How to encode the `(refname, update_index)` tuple so that it sorts the
> way we want, namely with entries grouped by refname and backwards in
> time? The simplest way would be `refname + \0 + uint32(0xffffffff -
> update_index)`, and that should prefix-compress relatively well. One
> could squeeze that down a bit by using a custom sort function, but
> that would complicate the code, and anyway, zlib hides all sins,
> doesn't it?

I'm with you this far, and like the {min,max}_update_index in the
header. I'm concerned about update_index in 32 bits. At some point you
need to reset the counter, or the repository is broken. 4b updates is
enough for anyone? I'd feel better about this being a 64 bit field.

> I'm also still unhappy that completely packing references down to one
> reftable will require all reflogs to be fully rewritten, even though
> the reflogs will mostly be append-only. And I'm worried that your
> comforting numbers about how well the reflogs compress won't apply to
> GitHub, because we store lots of information in our reflog messages. I
> think that some small tweaks would improve the situation.
>
> Suppose one could create `reflog-only` reftables. Then a `pack-refs`
> implementation could choose to create one or more `reflog-only`
> reftables at the bottom of the stack for accumulating reflogs (such
> files would rarely have to be compacted), and on top of that a big
> reftable holding most of the reference values, and on top of that
> multiple small reftables containing both reference values and reflogs
> (for recent updates). Perhaps one would even record in the table of
> contents which files in the stack contain what kinds of data, to avoid
> (for example) having to open reflog-only files at all if you are only
> interested in reference values.
>
> This would require a small amendment to my `update_index` suggestion
> above, namely that reflog-only and ref-only reftables might have
> overlapping `update_index` ranges, but the end result should be that
> the files together cover the whole range of `update_index`es for refs
> and also for reflogs.

Ack, I see the value in your suggestion for a reflog-only table.

I'll rework the document to incorporate your feedback above, and post
another iteration to the list.