Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format

Han-Wen Nienhuys <hanwen@xxxxxxxxxx> · Mon, 28 Nov 2022 19:56:55 +0100

On Mon, Nov 7, 2022 at 7:36 PM Derrick Stolee via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
>
>
> Introduction
> ============
>
> I became interested in our packed-ref format based on the asymmetry between
> ref updates and ref deletions: if we delete a packed ref, then the
> packed-refs file needs to be rewritten. Compared to writing a loose ref,
> this is an O(N) cost instead of O(1).
>
> In this way, I set out with some goals:
>
>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
>    updates.
>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
>    refs and creating a clear way to snapshot all refs at a given point in
>    time.
>
> I also had one major non-goal to keep things focused:
>
>  * (Non-goal) Update the reflog format.
>
> After carefully considering several options, it seemed that there are two
> solutions that can solve this effectively:
>
>  1. Wait for reftable to be integrated into Git.
>  2. Update the packed-refs backend to have a stacked version.
>
> The reftable work seems currently dormant. The format is pretty complicated
> and I have a difficult time seeing a way forward for it to be fully
> integrated into Git.

The format is somewhat complicated, and I think it would have been
possible to design a block-oriented sorted-table approach that is
simpler, but the JGit implementation has set it in stone. But, to put
this in perspective, the amount of work for getting the format to
read/write correctly has been completely dwarfed by the effort needed
to make the refs API in git represent a true abstraction boundary.
Also, if you're introducing a new format, one might as well try to
optimize it a bit.

Here are some of the hard problems that I encountered

* Worktrees and the main repository have a separate view of the ref
namespace. This is not explicit in the ref backend API, and there is a
technical limitation that the packed-refs file cannot be in a
worktree. This means that worktrees will always continue to use
loose-ref storage if you only extend the packed-refs backend.

* Symrefs are refs too, but for some reason the packed-refs file
doesn't support them. Does packed-refs v2 support symrefs too?  If you
want to snapshot the state of refs, do you want to snapshot the value
of HEAD too?

* By not changing reflogs, you are making things simpler. (if a
transaction updates the branch that HEAD points to, the reflog for
HEAD has to be updated too. Because reftable updates the reflog
transactionally, this was some extra work)
Then again, I feel the current way that reflogs work are a bit messy,
because directory/file conflicts force reflogs to be deleted at times
that don't make sense from a user-perspective.

* There are a lot of commands that store SHA1s in files under .git/,
and access them as if they are a ref (for example: rebase-apply/ ,
CHERRY_PICK_HEAD etc.).

> In this RFC, I propose a different model that allows for more customization
> and incremental updates. The extensions.refFormat config key is multi-valued
> and defaults to the list of files and packed. In the context of this RFC,
> the intention is to be able to add packed-v2 so the list of all three values
> would allow Git to write and read either file format version (v1 or v2). In
> the larger scheme, the extension could allow restricting to only loose refs
> (just files) or only packed-refs (just packed) or even later when reftable
> is complete, files and reftable could mean that loose refs are the primary
> ref storage, but the reftable format serves as a drop-in replacement for the
> packed-refs file. Not all combinations need to be understood by Git, but

I'm not sure how feasible this is. reftable also holds reflog data. A
setting {files,reftable} would either not work, or necessitate hairy
merging of data to get the reflogs working correctly.

> In order to optimize the write speed of the packed-refs v2 file format, we
> want to write immediately to the file as we stream existing refs from the
> current refs. The current chunk-format API requires computing the chunk
> lengths in advance, which can slow down the write and take more memory than

yes, this sounds sensible. reftable has the secondary indexes trailing the data.

> Between using raw OIDs and storing the depth-2 prefixes only once, this
> format compresses the file to ~60% of its v1 size. (The format allows not
> writing the prefix chunks, and the prefix chunks are implemented after the
> basics of the ref chunks are complete.)
>
> The write times are reduced in a similar fraction to the size difference.
> Reads are sped up somewhat, and we have the potential to do a ref count by

Do you mean 'enumerate refs' ? Why would you want to count refs by prefix?

> I mentioned earlier that I had considered using reftable as a way to achieve
> the stated goals. With the current state of that work, I'm not confident
> that it is the right approach here.
>
> My main worry is that the reftable is more complicated than we need for a
> typical Git repository that is based on a typical filesystem. This makes
> testing the format very critical, and we seem to not be near reaching that
> approach.

I think the base code of reading and writing the reftable format is
exercised quite exhaustively tested in unit tests. You say 'seem', but
do you have anything concrete to say?

> As mentioned, the current extension plan [6] only allows reftable or files
> and does not allow for a mix of both. This RFC introduces the possibility
> that both could co-exist. Using that multi-valued approach means that I'm
> able to test the v2 packed-refs file format almost as well as the v1 file
> format within this RFC. (More tests need to be added that are specific to
> this format, but I'm waiting for confirmation that this is an acceptable
> direction.) At the very least, this multi-valued approach could be used as a
> way to allow using the reftable format as a drop-in replacement for the
> packed-refs file, as well as upgrading an existing repo to use reftable.

The multi-value approach creates more combinations of code of how
different pieces of code can interact, so I think it actually makes it
more error-prone.
Also,

> That might even help the integration process to allow the reftable format to
> be tested at least by some subset of tests instead of waiting for a full
> test suite update.

I don't understand this comment. In the current state,
https://github.com/git/git/pull/1215 already passes 922 of the 968
test files if you set GIT_TEST_REFTABLE=1.

See https://github.com/git/git/pull/1215#issuecomment-1329579459 for
details. As you can see, for most test files, it's just a few
individual test cases that fail.

> I'm interested to hear from people more involved in the reftable work to see
> the status of that project and how it matches or differs from my
> perspective.

Overall, I found that the loose/packed ref code hard to understand and
full of arbitrary limitations (dir/file conflicts, deleting reflogs
when branches are deleted, locking across loose/packed refs etc.).
The way reftable stacks are setup (with both reflog and ref data
including symrefs in the same file) make it much easier to verify that
it behaves transactionally.

For deleting refs quickly, it seems that you only need to support
$ZEROID in packed-refs and then implement a ref database as a stack of
packed-ref files? If you're going for minimal effort and minimal
disruption wouldn't that be the place to start?

You're concerned about the reftable file format (and maybe rightly
so), but if you're changing the file format anyway and you're not
picking reftable, why not create a block-based, indexed format that
can support storing reflog entries at some point in the future too,
rather than build on (the limitations) of packed-refs? Or is
packed-refs v2 backward compatible with v1 (could an old git client
read v2 files? I think not, right?).

The reftable project has gotten into a slump because my work
responsibilities have increased over the last 1.5 year squeezing down
how much time I have for 'fun' projects. I chatted with John Cai, who
was trying to staff this project out of Gitlab resources. I don't know
where that stands, though.

> The one thing I can say is that if the reftable work had not already begun,
> then this is RFC is how I would have approached a new ref format.
>
> I look forward to your feedback!

Hope this helps.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Liana Sebastian