Re: reftable [v4]: new ref storage format

Shawn Pearce <spearce@xxxxxxxxxxx> · Tue, 1 Aug 2017 13:23:18 -0700

On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>> 4th iteration of the reftable storage format.
>> [...]
>
> Before we commit to Shawn's reftable proposal, I wanted to explore
> what a contrasting design that is not block based would look like. So
> I threw together a sketch of such a design. It is not as refined as
> Shawn's, and I haven't had time to program an implementation or
> collect statistics, but maybe it is interesting anyway.

Thanks for taking the time to write this out. Its constructive to see
other possible approaches.

I roughly implemented your proposed design in JGit for references
only. I skipped the object lookup and log handling for the sake of
studying the basics of this approach.

       | size   | seek_cold | seek_hot  |
mh     | 28.3 M | 24.5 usec | 14.5 usec |
sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

All tests were run with a 4K chunk/block size on an 866k ref example.
The mh approach does not use padding between chunks, so chunks are
variable sized, but not exceeding 4K. The differences between cold and
hot is whether or not the file was held open, and the root of the ref
index stays in memory.

In the mh approach with a 4K chunk size and 866k refs, the index is 2
levels deep. A cold seek needs to perform 3 disk reads to navigate the
index, hot seek is 2 disk reads.

> It seems to me that the block-orientation of Shawn's spec adds a lot
> of complication to the design (e.g., padding, optional blocks, varying
> block sizes for different parts of the file). It also seems that the
> format has been heavily optimized to be read in 64k blocks, whereas
> most users will be storing their references to local hard disks or
> (increasingly) to local SSDs. So I tried to come up with a design that
> is friendlier to the latter, hopefully without being much worse for
> the former.

Yes, my design is biased to run well on a 64k block size, where the
underlying disk is slow. Ideally reads require only one or two block
reads. Torn blocks (where a range of data lies on two different
blocks) are expensive, as both 64k blocks have to be loaded to acquire
data that lies across the boundary. Given how fast SSDs are, I don't
think this causes any problems for SSD users.

> One thing that I realized is that, assuming that reference values and
> reflogs are usually compacted separately, there will basically be
> three types of reftable files:
>
> 1. Accumulated references and their values (potentially very large)
>
> 2. Accumulated reflogs (potentially gigantic)
>
> 3. Recent transaction(s), which include both refs and reflogs (usually
>    quite small).
>
> The first two types of file don't benefit from separating reference
> data from reflog data in the file (because they only include one or
> the other). The third kind of file doesn't care because such files
> will be small.

In principal, I agree with your simplification. The important thing is
keeping the reflog away from the accumulated references, such that
scans of references (e.g. current ls-remote advertisement) is
efficient. I think I was also aiming to do this even for the
transaction log files, as ls-remote operations don't need the log
record data.

> So let's store all of the data related to a single reference in one
> spot. This simplifies the file format and removes the need for two
> separate indexes and/or variants that special-case the omission of an
> index. A disadvantage is that, it prevents compression of reflog
> entries across references. I think that is a reasonable tradeoff,
> because most reflog entries (in the repositories I checked; that might
> differ in Gerrit-managed repos) correspond to references that have
> been updated many times. To partly compensate for the resulting
> reduction in compression, I added some features to reduce the storage
> of redundant information in reflog entries.

I found the encoding of reflog entries below somewhat complicated, but
I haven't been able to implement it to compare storage size vs. the
deflated block I proposed.

> On the other hand, this design does not support binary searching for
> refnames. Instead, the refnames are distributed into a multiple-level
> tree. When looking for a reference, each tree node on the way to the
> reference needs to be parsed serially. But the nodes are small enough
> (due to the multilevel indexing scheme) that I hope that isn't a
> problem. I'd rather read less data at the expense of parsing a little
> bit more.

Looking at the numbers above, the binary search within a 4k block does
reduce lookup time. E.g. reftable has ~7 binary search restart points
within each 4k block, vs. your approach needing to parse up to 119
refs in a 4k chunk.

> * Multiple related chunks can be stored in a 64 kiB block (interesting
>   for people who prefer to read files in 64k segments). By storing
>   related chunks near each other, the number of seeks will probably
>   typically be smaller than the number of chunks that have to be
>   accessed.

This is difficult. The most related chunks are actually index chunks
in the multi-level index. You want the next step(s) near the current
step to avoid an actual disk seek. But that is hard to do when the
index is several levels deep.

> * The file can be written in two possible ways: with cross-references
>   between chunks always pointing to later parts of the file (probably
>   preferable if the data will be read from a local hard disk) or
>   always pointing to earlier parts of the file (to accomodate writers
>   who need to write their data in order). See the `offsets_negative`
>   field in the header. (I'm not sure that the former variant is
>   actually needed.)

This seems like unnecessary complexity. I'd prefer to just say the
offsets are negative, and reference backwards.