Re: folio_map

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 17 Aug 2022 22:34:05 +0100

On Wed, Aug 17, 2022 at 01:52:33PM -0700, Ira Weiny wrote:
> On Wed, Aug 17, 2022 at 01:23:41PM -0700, Ira wrote:
> > On Wed, Aug 17, 2022 at 08:38:52PM +0100, Matthew Wilcox wrote:
> > > On Wed, Aug 17, 2022 at 01:29:35PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> > > > > Some of you will already know all this, but I'll go into a certain amount
> > > > > of detail for the peanut gallery.
> > > > > 
> > > > > One of the problems that people want to solve with multi-page folios
> > > > > is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> > > > > already exist; you can happily create a 64kB block size filesystem on
> > > > > a PPC/ARM/... today, then fail to mount it on an x86 machine.
> > > > > 
> > > > > kmap_local_folio() only lets you map a single page from a folio.
> > > > > This works for the majority of cases (eg ->write_begin() works on a
> > > > > per-page basis *anyway*, so we can just map a single page from the folio).
> > > > > But this is somewhat hampering for ext2_get_page(), used for directory
> > > > > handling.  A directory record may cross a page boundary (because it
> > > > > wasn't a page boundary on the machine which created the filesystem),
> > > > > and juggling two pages being mapped at once is tricky with the stack
> > > > > model for kmap_local.
> > > > > 
> > > > > I don't particularly want to invest heavily in optimising for HIGHMEM.
> > > > > The number of machines which will use multi-page folios and HIGHMEM is
> > > > > not going to be large, one hopes, as 64-bit kernels are far more common.
> > > > > I'm happy for 32-bit to be slow, as long as it works.
> > > > > 
> > > > > For these reasons, I proposing the logical equivalent to this:
> > > > > 
> > > > > +void *folio_map_local(struct folio *folio)
> > > > > +{
> > > > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > > > +               return folio_address(folio);
> > > > > +       if (!folio_test_large(folio))
> > > > > +               return kmap_local_page(&folio->page);
> > > > > +       return vmap_folio(folio);
> > > > > +}
> > > > > +
> > > > > +void folio_unmap_local(const void *addr)
> > > > > +{
> > > > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > > > +               return;
> > > > > +       if (is_vmalloc_addr(addr))
> > > > > +               vunmap(addr);
> > > > > +	else
> > > > > +       	kunmap_local(addr);
> > > > > +}
> > > > > 
> > > > > (where vmap_folio() is a new function that works a lot like vmap(),
> > > > > chunks of this get moved out-of-line, etc, etc., but this concept)
> > > > 
> > > > So it aims at replacing kmap_local_page(), but for folios, right?
> > > > kmap_local_page() interface can be used from any context, but vmap helpers
> > > > might_sleep(). How do we rectify this?
> > > 
> > > I'm not proposing getting rid of kmap_local_folio().  That should still
> > > exist and work for users who need to use it in atomic context.  Indeed,
> > > I'm intending to put a note in the doc for folio_map_local() suggesting
> > > that users may prefer to use kmap_local_folio().  Good idea to put a
> > > might_sleep() in folio_map_local() though.
> > 
> > There is also a semantic miss-match WRT the unmapping order.  But I think
> > Kirill brings up a bigger issue.

I don't see the semantic mismatch?

> > How many folios do you think will need to be mapped at a time?  And is there
> > any practical limit on their size?  Are 64k blocks a reasonable upper bound
> > until highmem can be deprecated completely?
> > 
> > I say this because I'm not sure that mapping a 64k block would always fail.
> > These mappings are transitory.  How often will a filesystem be mapping more
> > than 2 folios at once?
> 
> I did the math wrong but I think my idea can still work.

The thing is that kmap_local_page() can be called from interrupt context
(how often is it?  no idea).  So you map two 64kB folios (at 16 entries
each) and that consumes 32 entries for this CPU, now you take an interrupt
and that's 33.  I don't know how deep that goes; can we have some mapped
in userspace, some mapped in softirq and then another interrupt causes
more to be mapped in hardirq?  I don't really want to find out, so I'd
rather always punt to vmap() for multipage folios.

Is there a reason you want to make folio_map_local() more efficient
on HIGHMEM systems?

> > 
> > In our conversions most of the time 2 pages are mapped at once,
> > source/destination.
> > 
> > That said, to help ensure that a full folio map never fails we could increase
> > the number of pages supported by kmap_local_page().  At first, I was not a fan
> > but that would only be a penalty for HIGHMEM systems.  And as we are not
> > optimizing for such systems I'm not sure I see a downside to increasing the
> > limit to 32 or even 64.  I'm also inclined to believe that HIGHMEM systems are
> > smaller core counts.  So I don't think this is likely to multiply the space
> > wasted much.
> > 
> > Would doubling the support within kmap_local_page() be enough?
> > 
> > A final idea would be to hide the increase behind a 'support large block size
> > filesystems' config option under HIGHMEM systems.  But I'm really not sure that
> > is even needed.
> > 
> > Ira
> >