Calling folio_end_writeback() inside a spinlock under task context?

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Mon, 3 Mar 2025 10:34:26 +1030

Hi,

[SPINLOCK AND END WRITEBACK]

Although folio_end_writeback() can be called in an interruption context
(by the in_task() check), surprisingly it may not be suitable to be
called inside a spinlock (in task context):

For example the following call chain can lead to sleep:

spin_lock()
folio_end_writeback()
|- folio_end_dropbehind_write()
   |- folio_launder()
      Which can sleep.

My question is, can and should we allow folio_end_writeback() inside a
spinlock?

[BTRFS' ASYNC EXTENT PROBLEM]

This is again a btrfs specific behavior, that we have to call
folio_end_writeback() inside a spinlock to make sure really only one
thread can clear the writeback flag of a folio.

I know iomap is doing a pretty good job preventing early finished
writeback to clear the folio writeback flag, meanwhile we're still
submitting other blocks inside the folio.

Iomap goes holding one extra writeback count before starting marking
blocks writeback and submitting them.
And after all blocks are submitted, reduce the writeback count (and call
folio_end_writeback() if it's the last one holding the writeback flag).

But the iomap solution requires that, all blocks inside a folio to be
submitted at the same time.

This is not true in btrfs, due to the design of btrfs' async extent,
which can mark the blocks for writeback really at any timing, as we keep
the lock of folios and queue them into a workqueue to do compression,
then only after the compression is done, we submit and mark them
writeback and unlock.

If we do not hold a spinlock calling folio_end_writeback(), but only
checks if we're the last holder and call folio_end_writeback() out of
the spinlock, we can hit the following race where folio_end_writeback()
can be called on the same folio:

     0          32K         64K
     |<- AE 1 ->|<- AE 2 ->|

            Thread A (AE 1)           |            Thread B (AE 2)
--------------------------------------+------------------------------
submit_one_async_extent()             |
|- process_one_folio()                |
     |- subpage_set_writeback()       |
                                      |
/* IO finished */                     |
end_compressed_writeback()            |
|- btrfs_folio_clear_writeback()      |
     |- spin_lock()                   |
     |  /* this is the last writeback |
     |     holder, should end the     |
     |     folio writeback flag */    |
     |- last = true                   |
     |- spin_unlock()                 |
     |                                | submit_one_async_extent()
     |                                | |- process_one_folio()
     |                                |    |- subpage_set_writeback()
     |                                |
     |                                | /* IO finished */
     |                                | end_compressed_writeback()
     |                                | |btrfs_folio_clear_writeback()
     |                                |    | /* Again the last holder */
     |                                |    |- spin_lock()
     |                                |    |- last = true
     |                                |    |- spin_unlock()
     |- folio_end_writeback()         |    |- folio_end_writeback()

I know the most proper solution would to get rid of the delayed unlock
and submission, but mark blocks for writeback without the async extent
mechanism completely, then follow the iomap's solution.

But that will be a huge change (although we will go that path
eventually), meanwhile the folio_end_writeback() inside spinlock needs
to be fixed asap.

So my question is, can we allow folio_end_writeback() to be called
inside a spinlock, at the cost of screwing up the folio reclaim for
DONTCACHE?

Thanks,
Qu