Re: [LSF/MM/BPF TOPIC] Memory folios

Hannes Reinecke <hare@xxxxxxx> · Thu, 27 May 2021 09:41:51 +0200



On 5/26/21 11:07 PM, Keith Busch wrote:
> On Fri, May 14, 2021 at 06:48:26PM +0100, Matthew Wilcox wrote:
>> On Mon, May 10, 2021 at 06:56:17PM +0100, Matthew Wilcox wrote:
>>> I don't know exactly how much will be left to discuss about supporting
>>> larger memory allocation units in the page cache by December.  In my
>>> ideal world, all the patches I've submitted so far are accepted, I
>>> persuade every filesystem maintainer to convert their own filesystem
>>> and struct page is nothing but a bad memory by December.  In reality,
>>> I'm just not that persuasive.
>>>
>>> So, probably some kind of discussion will be worthwhile about
>>> converting the remaining filesystems to use folios, when it's worth
>>> having filesystems opt-in to multi-page folios, what we can do about
>>> buffer-head based filesystems, and so on.
>>>
>>> Hopefully we aren't still discussing whether folios are a good idea
>>> or not by then.
>>
>> I got an email from Hannes today asking about memory folios as they
>> pertain to the block layer, and I thought this would be a good chance
>> to talk about them.  If you're not familiar with the term "folio",
>> https://lore.kernel.org/lkml/20210505150628.111735-10-willy@xxxxxxxxxxxxx/
>> is not a bad introduction.
>>
>> Thanks to the work done by Ming Lei in 2017, the block layer already
>> supports multipage bvecs, so to a first order of approximation, I don't
>> need anything from the block layer on down through the various storage
>> layers.  Which is why I haven't been talking to anyone in storage!
>>
>> It might change (slightly) the contents of bios.  For example,
>> bvec[n]->bv_offset might now be larger than PAGE_SIZE.  Drivers should
>> handle this OK, but probably haven't been audited to make sure they do.
>> Mostly, it's simply that drivers will now see fewer, larger, segments
>> in their bios.  Once a filesystem supports multipage folios, we will
>> allocate order-N pages as part of readahead (and sufficiently large
>> writes).  Dirtiness is tracked on a per-folio basis (not per page),
>> so folios take trips around the LRU as a single unit and finally make
>> it to being written back as a single unit.
>>
>> Drivers still need to cope with sub-folio-sized reads and writes.
>> O_DIRECT still exists and (eg) doing a sub-page, block-aligned write
>> will not necessarily cause readaround to happen.  Filesystems may read
>> and write their own metadata at whatever granularity and alignment they
>> see fit.  But the vast majority of pagecache I/O will be folio-sized
>> and folio-aligned.
>>
>> I do have two small patches which make it easier for the one
>> filesystem that I've converted so far (iomap/xfs) to add folios to bios
>> and get folios back out of bios:
>>
>> https://lore.kernel.org/lkml/20210505150628.111735-72-willy@xxxxxxxxxxxxx/
>> https://lore.kernel.org/lkml/20210505150628.111735-73-willy@xxxxxxxxxxxxx/
>>
>> as well as a third patch that estimates how large a bio to allocate,
>> given the current folio that it's working on:
>> https://git.infradead.org/users/willy/pagecache.git/commitdiff/89541b126a59dc7319ad618767e2d880fcadd6c2
>>
>> It would be possible to make other changes in future.  For example, if
>> we decide it'd be better, we could change bvecs from being (page, offset,
>> length) to (folio, offset, length).  I don't know that it's worth doing;
>> it would need to be evaluated on its merits.  Personally, I'd rather
>> see us move to a (phys_addr, length) pair, but I'm a little busy at the
>> moment.
>>
>> Hannes has some fun ideas about using the folio work to support larger
>> sector sizes, and I think they're doable.
> 
> I'm also interested in this, and was looking into the exact same thing
> recently. Some of the very high capacity SSDs that can really benefit
> from better large sector support. If this is a topic for the conference,
> I would like to attend this session.
> 
And, of course, so would I :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@xxxxxxx			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)