Re: reftable [v4]: new ref storage format

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 02 Aug 2017 12:50:39 -0700

Shawn Pearce <spearce@xxxxxxxxxxx> writes:

> ### Layout
>
> The `$GIT_DIR/refs` path is a file when reftable is configured, not a
> directory.  This prevents loose references from being stored.
>
> A collection of reftable files are stored in the `$GIT_DIR/reftable/`
> directory:
>
>     00000001_UF4paF
>     00000002_bUVgy4
>
> where reftable files are named by a unique name such as produced by
> the function:
>
>     mktemp "${update_index}_XXXXXX"
>
> The stack ordering file is `$GIT_DIR/refs` and lists the current
> files, one per line, in order, from oldest (base) to newest (most
> recent):
>
>     $ cat .git/refs
>     00000001_UF4paF
>     00000002_bUVgy4
>
> Readers must read `$GIT_DIR/refs` to determine which files are
> relevant right now, and search through the stack in reverse order
> (last reftable is examined first).
>
> Reftable files not listed in `refs` may be new (and about to be added
> to the stack by the active writer), or ancient and ready to be pruned.

I like the general idea, what the file format can represent and how
it does so, but I am a bit uneasy about how well this "stacked" part
would work for desktop clients.  The structure presented here is for
optimizing the "we want to learn about many (or all) refs" access
pattern, which probably matters a lot on the server implementations,
but I do not feel comfortable without knowing how much it penalizes
"I want the current value of this single ref" access pattern.

With the traditional "packed-refs plus loose" layout, no matter how
many times a handful of selected busy refs are updated during the
day, you'd need to open at most two files to find out the current
value of a single ref (admittedly, the accessing of the second file,
after we realize that there is no loose one, would be very costly).
If you make a few commits on a topic branch A, then build a 100
commit series on top of another topic branch B, finding the current
value of A is still one open and read of refs/heads/A.

With the reftable format, we'd need to open and read all 100
incremental transactions that touch branch B before realizing that
none of them talk about A, and read the next transaction file to
find the current value of A.  To keep this number low, we'd need
quite a frequent compaction.

We can just declare that reftable format is not for desktop clients
but for server implementations where frequent compaction would not
be an annoyance to the users, but I'd wish we do not have to.