On Thu, Jun 15, 2023 at 02:09:30PM -0700, Hugh Dickins wrote: > On Thu, 15 Jun 2023, Jason Gunthorpe wrote: > > On Wed, Jun 14, 2023 at 02:59:33PM -0700, Hugh Dickins wrote: > > > > > I guess the best thing would be to modify kernel/fork.c to allow the > > > architecture to override free_mm(), and arch/s390 call_rcu to free mm. > > > But as a quick and dirty s390-end workaround, how about: > > > > RCU callbacks are not ordered so that doesn't seem like it helps.. > > Thanks, that's an interesting and important point, which I need to knock > into my head better. > > But can you show me where that's handled in the existing mm/mmu_gather.c > include/asm-generic/tlb.h framework? I don't see any rcu_barrier()s > there, yet don't the pmd_huge_pte pointers point into pud page tables > freed shortly afterwards also by RCU? I don't know anything about the pmd_huge_pte stuff.. I was expecting it got cleaned up explicitly before things reached the call_rcu? Where is it touched from a call_rcu callback? > > Making the page frag pool global (per-cpu global I guess) would also > > remove the need to reach back to the freeable mm_struct and reduce the > > need for struct page memory. This views it as a special kind of > > kmemcache. > > I haven't thought in that direction at all. Hmm. Or did I think of > it once, but discarded for accounting reasons - IIRC (haven't rechecked) > page table pages are charged to memcg, and counted for meminfo and other(?) > purposes: if the fragments are all lumped into a global pool, we > lose that. You'd have to search the free list for fragments that match the current memcg to avoid creating mismatches :\, or rework how memcg accouting works for page tables - eg move the memcg from the struct page to the mm_struct so that each frag can be accounted differently. > > Can arches opt in to RCU freeing page table support and still keep > > your series sane? > > Yes, or perhaps we mean different things: I thought most architectures > are already freeing page tables by RCU. s390 included. > "git grep MMU_GATHER_RCU_TABLE_FREE" shows plenty of selects. MMU_GATHER_RCU_TABLE_FREE is a very confusing option. What it really says is that the architecture doesn't do an IPI so we sometimes use RCU as a replacement for the IPI, but not always. Specifically this means it doesn't allow rcu reading of the page tables. You still have to take the IPI blocking interrupt-disable lock to read page tables, even if MMU_GATHER_RCU_TABLE_FREE is set. IMHO I would be alot happier with what you were trying to do here if it came along with full RCU enabling of page tables so that we could say that the rcu_read_lock() is sufficient locking to read page tables *always*. I didn't really put together how this series works that we could introduce rcu_read_lock() in only one specific place.. My query was simpler - if we could find enough space to put a rcu_head in the ptdesc for many architectures, and thus *always* RCU free on many architectures, could you do what you want but disable it on S390 and POWER which would still have to rely on an RCU head allocation and a backup IPI? Jason