On Wed, Jun 14, 2023 at 02:59:33PM -0700, Hugh Dickins wrote: > I guess the best thing would be to modify kernel/fork.c to allow the > architecture to override free_mm(), and arch/s390 call_rcu to free mm. > But as a quick and dirty s390-end workaround, how about: RCU callbacks are not ordered so that doesn't seem like it helps.. synchronize_rcu would do the job since it is ordered, but I think the performance cost is too great to just call it from mmdrop rcu_barrier() followed by call_rcu on the mm struct might work, but I don't know the cost A per-cpu refcount scheme might also do the job reasonably Making the page frag pool global (per-cpu global I guess) would also remove the need to reach back to the freeable mm_struct and reduce the need for struct page memory. This views it as a special kind of kmemcache. Another approach is to not use a rcu_head in the ptdesc at all. With a global kmemcache-like-thing we could probably also organize something where you don't use a rcu_head in the ptdesc, but instead just a naked 'next' pointer. This would give enough space to have two next pointers and the next pointers can be re-used for the normal free list as well. In this flow you'd thread the free'd frags onto a waterfall of global per-cpu lists: - RCU free the next cycle - RCU free this cycle - Actually free Where a single rcu_head and single call_rcu frees the entire 2nd list to the 3rd list and then schedules the 1st list to be RCU'd next. This eliminates the need to store a function pointer in the ptdesc at all. It requires some global per-cpu lock on the free/alloc paths however, but this is basically what every other arch does as it frees the page back to the page allocator. I suspect that two next pointers would also eliminate pt_frag_refcount entirely as we can encode that information in the low bits of the next pointers. > (Funnily enough, there's no problem when the stored mm gets re-used for > a different mm, once past its spin_lock_init(&mm->context.lock); > because We do that have really weird "type safe by rcu" thing in the allocators, but I don't quite know how it works. > Powerpc is like that. I have no idea how much gets wasted that way. > I was keen not to degrade what s390 does: which is definitely superior, > but possibly not worth the effort. Yeah, it would be good to understand if this is really sufficiently beneficial.. > I'll look into it, once I understand c2c224932fd0. But may have to write > to Vishal first, or get the v2 of my series out: if only I could work out > a safe and easy way of unbreaking s390... Can arches opt in to RCU freeing page table support and still keep your series sane? Honestly, I feel like trying to RCU enable page tables should be its own series. It is a sufficiently tricky subject on its own right. Jason