Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

Byungchul Park <byungchul@xxxxxx> · Fri, 14 Jun 2024 11:45:18 +0900

On Tue, Jun 11, 2024 at 01:55:05PM +0200, Michal Hocko wrote:
> On Tue 11-06-24 09:55:23, Byungchul Park wrote:
> > On Mon, Jun 10, 2024 at 03:23:49PM +0200, Michal Hocko wrote:
> > > On Tue 04-06-24 09:34:48, Byungchul Park wrote:
> > > > On Mon, Jun 03, 2024 at 06:01:05PM +0100, Matthew Wilcox wrote:
> > > > > On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
> > > > > > Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
> > > > > >  Presumably some xa_value() that means a reader has to go do a
> > > > > > luf_flush() before going any farther.
> > > > > 
> > > > > I can allocate one for that.  We've got something like 1000 currently
> > > > > unused values which can't be mistaken for anything else.
> > > > > 
> > > > > > That would actually have a chance at fixing two issues:  One where a new
> > > > > > page cache insertion is attempted.  The other where someone goes to look
> > > > > > in the page cache and takes some action _because_ it is empty (I think
> > > > > > NFS is doing some of this for file locks).
> > > > > > 
> > > > > > LUF is also pretty fundamentally built on the idea that files can't
> > > > > > change without LUF being aware.  That model seems to work decently for
> > > > > > normal old filesystems on normal old local block devices.  I'm worried
> > > > > > about NFS, and I don't know how seriously folks take FUSE, but it
> > > > > > obviously can't work well for FUSE.
> > > > > 
> > > > > I'm more concerned with:
> > > > > 
> > > > >  - page goes back to buddy
> > > > >  - page is allocated to slab
> > > > 
> > > > At this point, tlb flush needed will be performed in prep_new_page().
> > > 
> > > But that does mean that an unaware caller would get an additional
> > > overhead of the flushing, right? I think it would be just a matter of
> > 
> > pcp for locality is already a better source of side channel attack.  FYI,
> > tlb flush gets barely performed only if pending tlb flush exists.
> 
> Right but rare and hard to predict latencies are much worse than
> consistent once.

No doubt it'd be the best if we keep things consistent as long as
possible.  How consistent *we require* it would be, matters.  Lemme know
criteria for that if any.  I will check it.

> > > time before somebody can turn that into a side channel attack, not to
> > > mention unexpected latencies introduced.
> > 
> > Nope.  The pending tlb flush performed in prep_new_page() is the one
> > that would've done already with the vanilla kernel.  It's not additional
> > tlb flushes but it's subset of all the skipped ones.
> 
> But those skipped once could have happened in a completely different
> context (e.g. a different process or even a diffrent security domain),
> right?

Right.

> > It's worth noting all the existing mm reclaim mechaisms have already
> > introduced worse unexpected latencies.
> 
> Right, but a reclaim, especially direct reclaim, are expected to be
> slow. It is much different to see spike latencies on system with a lot
> of memory.

Talking about rt system?  In rt system, the system should prevent its
memory from being reclaimed, IMHO, since these will add unexpected
latencies.

Reclaim and migrations alreay introduce unexpected latencies themselves.
Why does only latencies by luf matter?  I'm asking to understand what
you mean, in order to fix luf if any.

   vanilla
   -------
   alloc_page() {
      ...
      preempted by kswapd or direct reclaim {
         ...
         reclaim
            unmap file pages
   	 tlb shootdown
         ...
         migration
            unmap pages
   	 tlb shootdown
         ...
      }
      ...
      interrupted by tlb shootdown from other CPUs {
         ...
      }
      ...
      prep_new_page() {
         ...
      }
   }

   with luf
   --------
   alloc_page() {
      ...
      preempted by kswapd or direct reclaim {
         ...
         reclaim
            unmap file pages
   	 (skip tlb shootdown)
         ...
         migration
            unmap pages
   	 (skip tlb shootdown)
         ...
      }
      ...
      interrupted by tlb shootdown from other CPUs {
         ...
      }
      ...
      prep_new_page() {
         ...
         /*
          * This can be tlb shootdown skipped in this context or others.
          */
         tlb shootdown with much smaller cpumask
         ...
      }
   }

I really want to understand why only latentcies introduced in luf
matter?  Why does not latencies already introduced in vanilla matter?

	Byungchul

> -- 
> Michal Hocko
> SUSE Labs