Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

Jason Gunthorpe <jgg@xxxxxxxx> · Fri, 16 Jun 2023 09:35:41 -0300

On Thu, Jun 15, 2023 at 02:09:30PM -0700, Hugh Dickins wrote:
> On Thu, 15 Jun 2023, Jason Gunthorpe wrote:
> > On Wed, Jun 14, 2023 at 02:59:33PM -0700, Hugh Dickins wrote:
> > 
> > > I guess the best thing would be to modify kernel/fork.c to allow the
> > > architecture to override free_mm(), and arch/s390 call_rcu to free mm.
> > > But as a quick and dirty s390-end workaround, how about:
> > 
> > RCU callbacks are not ordered so that doesn't seem like it helps..
> 
> Thanks, that's an interesting and important point, which I need to knock
> into my head better.
> 
> But can you show me where that's handled in the existing mm/mmu_gather.c
> include/asm-generic/tlb.h framework?  I don't see any rcu_barrier()s
> there, yet don't the pmd_huge_pte pointers point into pud page tables
> freed shortly afterwards also by RCU?

I don't know anything about the pmd_huge_pte stuff.. I was expecting
it got cleaned up explicitly before things reached the call_rcu? Where is it
touched from a call_rcu callback?

> > Making the page frag pool global (per-cpu global I guess) would also
> > remove the need to reach back to the freeable mm_struct and reduce the
> > need for struct page memory. This views it as a special kind of
> > kmemcache.
> 
> I haven't thought in that direction at all.  Hmm.  Or did I think of
> it once, but discarded for accounting reasons - IIRC (haven't rechecked)
> page table pages are charged to memcg, and counted for meminfo and other(?)
> purposes: if the fragments are all lumped into a global pool, we
> lose that.

You'd have to search the free list for fragments that match the
current memcg to avoid creating mismatches :\, or rework how memcg
accouting works for page tables - eg move the memcg from the struct
page to the mm_struct so that each frag can be accounted differently.

> > Can arches opt in to RCU freeing page table support and still keep
> > your series sane?
> 
> Yes, or perhaps we mean different things: I thought most architectures
> are already freeing page tables by RCU.  s390 included.
> "git grep MMU_GATHER_RCU_TABLE_FREE" shows plenty of selects.

MMU_GATHER_RCU_TABLE_FREE is a very confusing option. What it really
says is that the architecture doesn't do an IPI so we sometimes use
RCU as a replacement for the IPI, but not always.

Specifically this means it doesn't allow rcu reading of the page
tables. You still have to take the IPI blocking interrupt-disable lock
to read page tables, even if MMU_GATHER_RCU_TABLE_FREE is set.

IMHO I would be alot happier with what you were trying to do here if
it came along with full RCU enabling of page tables so that we could
say that the rcu_read_lock() is sufficient locking to read page tables
*always*.

I didn't really put together how this series works that we could
introduce rcu_read_lock() in only one specific place..

My query was simpler - if we could find enough space to put a rcu_head
in the ptdesc for many architectures, and thus *always* RCU free on
many architectures, could you do what you want but disable it on S390
and POWER which would still have to rely on an RCU head allocation and
a backup IPI?

Jason