Re: Calling folio_end_writeback() inside a spinlock under task context?

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Mon, 3 Mar 2025 16:03:37 +1030

在 2025/3/3 15:22, Matthew Wilcox 写道:
Adding Jens to the cc.  As you well know, he added this code, so I'm
mystified why you didn't cc him.  Also adding linux-fsdevel (I presume
this was a mistake and you inadvertently cc'd f2fs-devel instead.)

On Mon, Mar 03, 2025 at 10:34:26AM +1030, Qu Wenruo wrote:
[SPINLOCK AND END WRITEBACK]

Although folio_end_writeback() can be called in an interruption context
(by the in_task() check), surprisingly it may not be suitable to be
called inside a spinlock (in task context):

It's poor practice to do that; you're introducing a dependency between
your fs lock and the i_mapping lock, which is already pretty intertwined
with ... every other lock in the system.

For example the following call chain can lead to sleep:

spin_lock()
folio_end_writeback()
|- folio_end_dropbehind_write()
    |- folio_launder()
       Which can sleep.

My question is, can and should we allow folio_end_writeback() inside a
spinlock?

[BTRFS' ASYNC EXTENT PROBLEM]

This is again a btrfs specific behavior, that we have to call
folio_end_writeback() inside a spinlock to make sure really only one
thread can clear the writeback flag of a folio.

I know iomap is doing a pretty good job preventing early finished
writeback to clear the folio writeback flag, meanwhile we're still
submitting other blocks inside the folio.

Iomap goes holding one extra writeback count before starting marking
blocks writeback and submitting them.
And after all blocks are submitted, reduce the writeback count (and call
folio_end_writeback() if it's the last one holding the writeback flag).

But the iomap solution requires that, all blocks inside a folio to be
submitted at the same time.

I aactually don't like the iomap solution as it's currently written; it
just hasn't risen high enough up my todo list to fix it.

What we should do is initialise folio->ifs->write_bytes_pending to
bitmap_weight(ifs->state, blocks_per_folio) * block_size in
iomap_writepage_map() [this is oversimplified; we'd need to handle eof
and so on too]

So you mean setting every block writeback (I know iomap doesn't track
per-block writeback) at the beginning, and clear the per-block writeback
flags for hole/eof etc, and let the to-be-submitted blocks to continue
their endio?

That's indeed solves the problem without the extra count.

I can go definitely that solution in btrfs first.

That would address your problem as well as do fewer atomic operations.

This is not true in btrfs, due to the design of btrfs' async extent,
which can mark the blocks for writeback really at any timing, as we keep
the lock of folios and queue them into a workqueue to do compression,
then only after the compression is done, we submit and mark them
writeback and unlock.

If we do not hold a spinlock calling folio_end_writeback(), but only
checks if we're the last holder and call folio_end_writeback() out of
the spinlock, we can hit the following race where folio_end_writeback()
can be called on the same folio:

      0          32K         64K
      |<- AE 1 ->|<- AE 2 ->|

             Thread A (AE 1)           |            Thread B (AE 2)
--------------------------------------+------------------------------
submit_one_async_extent()             |
|- process_one_folio()                |
      |- subpage_set_writeback()       |
                                       |
/* IO finished */                     |
end_compressed_writeback()            |
|- btrfs_folio_clear_writeback()      |
      |- spin_lock()                   |
      |  /* this is the last writeback |
      |     holder, should end the     |
      |     folio writeback flag */    |
      |- last = true                   |
      |- spin_unlock()                 |
      |                                | submit_one_async_extent()
      |                                | |- process_one_folio()
      |                                |    |- subpage_set_writeback()

This seems weird.  Why are you setting the "subpage" writeback bit
while the folio writeback bit is still set?  That smells like a
bug-in-waiting if not an actual bug to me.  Surely it should wait on
the folio writeback bit to clear?

I considered waiting for the writeback, before setting it.
But it can still race, between the folio_wait_writeback() and the next
folio_start_writeback() call.

Where another async extent can start setting the the flag, and we're
hitting the same problem.

The key problem is still the async extent behavior, which I have to say
is way too problematic, and doesn't take fs block size < page size cases
into consideration at all.

      |                                | /* IO finished */
      |                                | end_compressed_writeback()
      |                                | |btrfs_folio_clear_writeback()
      |                                |    | /* Again the last holder */
      |                                |    |- spin_lock()
      |                                |    |- last = true
      |                                |    |- spin_unlock()
      |- folio_end_writeback()         |    |- folio_end_writeback()

I know the most proper solution would to get rid of the delayed unlock
and submission, but mark blocks for writeback without the async extent
mechanism completely, then follow the iomap's solution.

But that will be a huge change (although we will go that path
eventually), meanwhile the folio_end_writeback() inside spinlock needs
to be fixed asap.

I'd suggest the asap solution is for btrfs to mark itself as not
supporting dropbehind.

Thanks a lot for this solution, that indeed solves the problem by
completely avoid folio release inside folio_end_writeback().

That will be the hot fix, and pushed for backports.

Just wondering what else will be affected other than the DONTCACHE writes?

I also have a middle ground solution, by disabling async extent for
ranges which are not page aligned (so no more than one async per folio,
thus no race), then use your improved writeback flag tracking and move
the folio_clear_writeback() out of the spinlock, but that's not a good
candidate for backport at all...

Thanks,
Qu

So my question is, can we allow folio_end_writeback() to be called
inside a spinlock, at the cost of screwing up the folio reclaim for
DONTCACHE?

Thanks,
Qu