The code makes me feel we're fixing the wrong thing, but I have to admin I don't fully understand it. Let me go step by step. > While stress-testing online repair of btrees, I noticed periodic > assertion failures from the buffer cache about buffer readers > encountering buffers with DELWRI_Q set, even though the btree bulk load > had already committed and the buffer itself wasn't on any delwri list. What assert do these buffer reader hit? The two I can spot that are in the read path in the broader sense are: 1) in xfs_buf_find_lock for stale buffers. 2) in __xfs_buf_submit just before I/O submission > I traced this to a misunderstanding of how the delwri lists work, > particularly with regards to the AIL's buffer list. If a buffer is > logged and committed, the buffer can end up on that AIL buffer list. If > btree repairs are run twice in rapid succession, it's possible that the > first repair will invalidate the buffer and free it before the next time > the AIL wakes up. This clears DELWRI_Q from the buffer state. Where "this clears" is xfs_buf_stale called from xfs_btree_free_block via xfs_trans_binval? > If the second repair allocates the same block, it will then recycle the > buffer to start writing the new btree block. If my above theory is correct: how do we end up reusing a stale buffer? If not, what am I misunderstanding above? > Meanwhile, if the AIL > wakes up and walks the buffer list, it will ignore the buffer because it > can't lock it, and go back to sleep. And I think this is where the trouble starts - we have a buffer that is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared, it is stale and we then reuse it. I don't think we just need to kick it off the delwri list just for btree staging, but in general. > > When the second repair calls delwri_queue to put the buffer on the > list of buffers to write before committing the new btree, it will set > DELWRI_Q again, but since the buffer hasn't been removed from the AIL's > buffer list, it won't add it to the bulkload buffer's list. > > This is incorrect, because the bulkload caller relies on delwri_submit > to ensure that all the buffers have been sent to disk /before/ > committing the new btree root pointer. This ordering requirement is > required for data consistency. > > Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally > drop it, so the next thread to walk through the btree will trip over a > debug assertion on that flag. Where do it finally drop it? > To fix this, create a new function that waits for the buffer to be > removed from any other delwri lists before adding the buffer to the > caller's delwri list. By waiting for the buffer to clear both the > delwri list and any potential delwri wait list, we can be sure that > repair will initiate writes of all buffers and report all write errors > back to userspace instead of committing the new structure. If my understanding above is correct this just papers over the bug that a buffer that is marked stale and can be reused for something else is left on a delwri list. I've entirely thought about all the consequence, but here is what I'd try: - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q call your new wait code instead of asserting (probably only for the !trylock case) - make sure we don't leak DELWRI_Q