Re: Calling folio_end_writeback() inside a spinlock under task context?

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Mon, 3 Mar 2025 04:52:15 +0000

Adding Jens to the cc.  As you well know, he added this code, so I'm
mystified why you didn't cc him.  Also adding linux-fsdevel (I presume
this was a mistake and you inadvertently cc'd f2fs-devel instead.)

On Mon, Mar 03, 2025 at 10:34:26AM +1030, Qu Wenruo wrote:
> [SPINLOCK AND END WRITEBACK]
> 
> Although folio_end_writeback() can be called in an interruption context
> (by the in_task() check), surprisingly it may not be suitable to be
> called inside a spinlock (in task context):

It's poor practice to do that; you're introducing a dependency between
your fs lock and the i_mapping lock, which is already pretty intertwined
with ... every other lock in the system.

> For example the following call chain can lead to sleep:
> 
> spin_lock()
> folio_end_writeback()
> |- folio_end_dropbehind_write()
>    |- folio_launder()
>       Which can sleep.
> 
> My question is, can and should we allow folio_end_writeback() inside a
> spinlock?
> 
> [BTRFS' ASYNC EXTENT PROBLEM]
> 
> This is again a btrfs specific behavior, that we have to call
> folio_end_writeback() inside a spinlock to make sure really only one
> thread can clear the writeback flag of a folio.
> 
> I know iomap is doing a pretty good job preventing early finished
> writeback to clear the folio writeback flag, meanwhile we're still
> submitting other blocks inside the folio.
> 
> Iomap goes holding one extra writeback count before starting marking
> blocks writeback and submitting them.
> And after all blocks are submitted, reduce the writeback count (and call
> folio_end_writeback() if it's the last one holding the writeback flag).
> 
> But the iomap solution requires that, all blocks inside a folio to be
> submitted at the same time.

I aactually don't like the iomap solution as it's currently written; it
just hasn't risen high enough up my todo list to fix it.

What we should do is initialise folio->ifs->write_bytes_pending to
bitmap_weight(ifs->state, blocks_per_folio) * block_size in
iomap_writepage_map() [this is oversimplified; we'd need to handle eof
and so on too]

That would address your problem as well as do fewer atomic operations.

> This is not true in btrfs, due to the design of btrfs' async extent,
> which can mark the blocks for writeback really at any timing, as we keep
> the lock of folios and queue them into a workqueue to do compression,
> then only after the compression is done, we submit and mark them
> writeback and unlock.
> 
> If we do not hold a spinlock calling folio_end_writeback(), but only
> checks if we're the last holder and call folio_end_writeback() out of
> the spinlock, we can hit the following race where folio_end_writeback()
> can be called on the same folio:
> 
>      0          32K         64K
>      |<- AE 1 ->|<- AE 2 ->|
> 
>             Thread A (AE 1)           |            Thread B (AE 2)
> --------------------------------------+------------------------------
> submit_one_async_extent()             |
> |- process_one_folio()                |
>      |- subpage_set_writeback()       |
>                                       |
> /* IO finished */                     |
> end_compressed_writeback()            |
> |- btrfs_folio_clear_writeback()      |
>      |- spin_lock()                   |
>      |  /* this is the last writeback |
>      |     holder, should end the     |
>      |     folio writeback flag */    |
>      |- last = true                   |
>      |- spin_unlock()                 |
>      |                                | submit_one_async_extent()
>      |                                | |- process_one_folio()
>      |                                |    |- subpage_set_writeback()

This seems weird.  Why are you setting the "subpage" writeback bit
while the folio writeback bit is still set?  That smells like a
bug-in-waiting if not an actual bug to me.  Surely it should wait on
the folio writeback bit to clear?

>      |                                | /* IO finished */
>      |                                | end_compressed_writeback()
>      |                                | |btrfs_folio_clear_writeback()
>      |                                |    | /* Again the last holder */
>      |                                |    |- spin_lock()
>      |                                |    |- last = true
>      |                                |    |- spin_unlock()
>      |- folio_end_writeback()         |    |- folio_end_writeback()
> 
> I know the most proper solution would to get rid of the delayed unlock
> and submission, but mark blocks for writeback without the async extent
> mechanism completely, then follow the iomap's solution.
> 
> But that will be a huge change (although we will go that path
> eventually), meanwhile the folio_end_writeback() inside spinlock needs
> to be fixed asap.

I'd suggest the asap solution is for btrfs to mark itself as not
supporting dropbehind.

> So my question is, can we allow folio_end_writeback() to be called
> inside a spinlock, at the cost of screwing up the folio reclaim for
> DONTCACHE?
> 
> Thanks,
> Qu
>