Re: [PATCH 00/10] reftable: fix reload with active iterators

Patrick Steinhardt <ps@xxxxxx> · Thu, 22 Aug 2024 14:53:48 +0200

On Thu, Aug 22, 2024 at 08:41:00AM -0400, Jeff King wrote:
> On Mon, Aug 19, 2024 at 05:39:38PM +0200, Patrick Steinhardt wrote:
> 
> > This patch series fixes that issue by starting to refcount the readers.
> > Each iterator will bump the refcount, thus avoiding the problem. While
> > at it, I also found a second issue where we segfault when reloading a
> > table fails while reusing one of the table readers. In this scenario, we
> > would end up releasing the reader of the stack itself, even though it
> > would still be used by it.
> 
> I gave a fairly cursory look over this, as I'm not all that familiar
> with the reftable code. But everything looked pretty sensible to me.

Thanks for your review, appreciated!

> I wondered how we might test this. It looks like you checked the
> --stress output (which I confirmed seems fixed for me), along with a
> synthetic test directly calling various reftable functions. Both seem
> good to me.
> 
> I had hoped to be able to have a non-racy external test, where running
> actual Git commands showed the segfault in a deterministic way. But I
> couldn't come up with one. One trick we've used for "pausing" a reading
> process is to run "cat-file --batch" from a fifo. You can ask it to do
> some ref lookups, then while it waits for more input, update the
> reftable, and then ask it to do more.
> 
> But I don't think that's sufficient here, as the race happens while we
> are actually iterating. You'd really need some for_each_ref() callback
> that blocks in some externally controllable way. Possibly you could do
> something clever with partial-clone lazy fetches (where you stall the
> fetch and then do ref updates in the middle), but that is getting pretty
> convoluted.
> 
> So I think the tests you included seem like a good place to stop.

Yeah. I think demonstrating the issues on the API level is okayish. It
would of course have been nice to also do it via our shell scripts, but
as you say, it feels rather fragile.

> I did have a little trouble applying this for testing. I wanted to do it
> on 'next', which has the maintenance changes to cause the race. But
> merging it there ended up with a lot of conflicts with other reftable
> topics (especially the tests). I was able to resolve them all, but you
> might be able to make Junio's life easier by coordinating the topics a
> bit.

I can certainly do that. I know that there are conflicts both with the
patch series dropping the generic tables and with the patch series that
move the reftable unit tests into our own codebase. I did address the
former by basing it on top of that series, but didn't yet address the
latter.

I'm okay with waiting a bit until most of the conflicting topics land. I
guess most of them should be close to landing anyway. Alternatively, I
can also pull all of them in as dependencies.

Patrick