Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2023/3/1 13:51, Gao Xiang wrote:


On 2023/3/1 13:42, Matthew Wilcox wrote:
On Wed, Mar 01, 2023 at 01:09:34PM +0800, Gao Xiang wrote:
On 2023/3/1 13:01, Matthew Wilcox wrote:
On Wed, Mar 01, 2023 at 12:49:10PM +0800, Gao Xiang wrote:
The only problem is that the readahead code doesn't tell the filesystem
whether the request is sync or async.  This should be a simple matter
of adding a new 'bool async' to the readahead_control and then setting
REQ_RAHEAD based on that, rather than on whether the request came in
through readahead() or read_folio() (eg see mpage_readahead()).

Great!  In addition to that, just (somewhat) off topic, if we have a
"bool async" now, I think it will immediately have some users (such as
EROFS), since we'd like to do post-processing (such as decompression)
immediately in the same context with sync readahead (due to missing
pages) and leave it to another kworker for async readahead (I think
it's almost same for decryption and verification).

So "bool async" is quite useful on my side if it could be possible
passed to fs side.  I'd like to raise my hands to have it.

That's a really interesting use-case; thanks for bringing it up.

Ideally, we'd have the waiting task do the
decompression/decryption/verification for proper accounting of CPU.
Unfortunately, if the folio isn't uptodate, the task doesn't even hold
a reference to the folio while it waits, so there's no way to wake the
task and let it know that it has work to do.  At least not at the moment
... let me think about that a bit (and if you see a way to do it, feel
free to propose it)

Honestly, I'd like to take the folio lock until all post-processing is
done and make it uptodate and unlock so that only we need is to pass
locked-folios requests to kworkers for async way or sync handling in
the original context.

If we unlocked these folios in advance without uptodate, which means
we have to lock it again (which could have more lock contention) and
need to have a way to trace I/Oed but not post-processed stuff in
addition to no I/Oed stuff.

Right, look at how it's handled right now ...

sys_read() ends up in filemap_get_pages() which (assuming no folio in
cache) calls page_cache_sync_readahead().  That creates locked, !uptodate
folios and asks the filesystem to fill them.  Unless that completes
incredibly quickly, filemap_get_pages() ends up in filemap_update_page()
which calls folio_put_wait_locked().

If the filesystem BIO completion routine could identify if there was
a task waiting and select one of them, it could wake up the waiter and
pass it a description of what work it needed to do (with the folio still
locked), rather than do the postprocessing itself and unlock the folio

Currently EROFS sync decompression is waiting in .readahead() with locked
page cache folios, one "completion" together than BIO descriptor
(bi_private) in the original context, so that the filesystem BIO completion
just needs to complete the completion and wakeup the original context
(due to missing pages, so the original context will need the page data
immediately as well) to go on .readhead() and unlock folios.

Does this way have some flew? Or I'm missing something?

In this way, EROFS sync decompression is just all handled with a completion
in .readahead() and mark uptodate & unlock folios before out of .readahead(),
so filemap_update_page() will (almost) always succeed in folio_test_uptodate()
before filemap_update_page():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/zdata.c?h=v6.2#n1167
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/zdata.c?h=v6.2#n1167

So I think core-MM just needs to export "bool async" for fses...

Thanks,
Gao Xiang


Thanks,
Gao Xiang


But that all seems _very_ hard to do with 100% reliability.  Note the
comment in folio_wait_bit_common() which points out that the waiters
bit may be set even when there are no waiters.  The wake_up code
doesn't seem to support this kind of thing (all waiters are
non-exclusive, but only wake up one of them).



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux