Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Fri, 24 Nov 2023 21:49:28 -0800

The code makes me feel we're fixing the wrong thing, but I have to
admin I don't fully understand it. Let me go step by step.

> While stress-testing online repair of btrees, I noticed periodic
> assertion failures from the buffer cache about buffer readers
> encountering buffers with DELWRI_Q set, even though the btree bulk load
> had already committed and the buffer itself wasn't on any delwri list.

What assert do these buffer reader hit?  The two I can spot that are in
the read path in the broader sense are:

  1) in xfs_buf_find_lock for stale buffers.
  2) in __xfs_buf_submit just before I/O submission

> I traced this to a misunderstanding of how the delwri lists work,
> particularly with regards to the AIL's buffer list.  If a buffer is
> logged and committed, the buffer can end up on that AIL buffer list.  If
> btree repairs are run twice in rapid succession, it's possible that the
> first repair will invalidate the buffer and free it before the next time
> the AIL wakes up.  This clears DELWRI_Q from the buffer state.

Where "this clears" is xfs_buf_stale called from xfs_btree_free_block
via xfs_trans_binval?

> If the second repair allocates the same block, it will then recycle the
> buffer to start writing the new btree block.

If my above theory is correct: how do we end up reusing a stale buffer?
If not, what am I misunderstanding above?

> Meanwhile, if the AIL
> wakes up and walks the buffer list, it will ignore the buffer because it
> can't lock it, and go back to sleep.

And I think this is where the trouble starts - we have a buffer that
is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared,
it is stale and we then reuse it.  I don't think we just need to kick
it off the delwri list just for btree staging, but in general.

> 
> When the second repair calls delwri_queue to put the buffer on the
> list of buffers to write before committing the new btree, it will set
> DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
> buffer list, it won't add it to the bulkload buffer's list.
>
> This is incorrect, because the bulkload caller relies on delwri_submit
> to ensure that all the buffers have been sent to disk /before/
> committing the new btree root pointer.  This ordering requirement is
> required for data consistency.
>
> Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
> drop it, so the next thread to walk through the btree will trip over a
> debug assertion on that flag.

Where do it finally drop it?

> To fix this, create a new function that waits for the buffer to be
> removed from any other delwri lists before adding the buffer to the
> caller's delwri list.  By waiting for the buffer to clear both the
> delwri list and any potential delwri wait list, we can be sure that
> repair will initiate writes of all buffers and report all write errors
> back to userspace instead of committing the new structure.

If my understanding above is correct this just papers over the bug
that a buffer that is marked stale and can be reused for something
else is left on a delwri list.  I've entirely thought about all the
consequence, but here is what I'd try:

 - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q
   call your new wait code instead of asserting (probably only
   for the !trylock case)
 - make sure we don't leak DELWRI_Q