Re: Taking page faults in RCU critical sections

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Thu, 2 Jan 2025 19:58:09 -0500

Hi Karim and Paul,

On Thu, Jan 2, 2025 at 2:16 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
On Thu, Jan 02, 2025 at 06:23:43PM +0000, Karim Manaouil wrote:
> > First, I wish you a happy new year!
> Hello, Karim, and a very happy square new year to you and yours as well!
> I have added the rcu email list in case someone else has ideas.

Happy new year wishes to you all as well!

> > I am working on implementing page migration for some types of kernel
> > memory. My technique is to remap those kernel pages in vmap/vmalloc
> > area and allow the kernel to take page faults during page migration.
> >
> > However, I have the problem of spinlocks and RCU critical sections.
> > A page fault can occur while the kernel is inside an RCU read critical
> > section. For example, in fs/dcache.c:dget_parent():
> >
> > rcu_read_lock()
> > seq = raw_seqcount_begin(&dentry->d_seq);
> > rcu_read_unlock()
> >
> > If the kernel page where "dentry" belongs to is undergoing migration,
> > a page fault could occur on the CPU executing the code above, when the
> > migration thread (running on another CPU) clears the corresponding
> > PTE entry in vmap and flushes the TLB (but the new page is not mapped
> > yet).
> >
> > The page table entries are replaced by migration entries, and the CPU,
> > on which the page fault happened, will have to wait or spin in the page
> > fault handler until the migration is complete (success or failure).
> >
> > With calssical RCU, I cannot wait in the page fault handler (like it's
> > done in migration_entry_wait()) because that's explicit blocking and
> > that's prohihited.
>
> Indeed it is, and by design.

True. Interesting problem.

> > Do you have any ideas for how to properly approach this problem?
>
> Here are a few to start with:
>
> 0.      Look at the existing code that migrates processes and/or kernels
>         from one system to another, and then do whatever they do.
>
> 1.      Allocate the needed memory up front, before acquiring the
>         spinlocks and before entering the RCU readers.
>
> 2.      Move the existing spinlocks to mutexes and the existing uses
>         of RCU to SRCU, perhaps using srcu_read_lock_lite().  But note
>         that a great deal of review and benchmarking will be necessary
>         to prove that there are no regressions.  And that changes of
>         this sort in mm almost always result in regressions.
>
>         So I strongly advise you not to take this approach lightly.
>
> 3.      Your ideas here!

I am a bit nervous that if it is entirely possible to eliminate
page-fault rabbit holes, what if the page fault handler itself causes
a fault because it accesses some memory that was now backed by an
invalid PTE?

To that end, I was wondering if any of the following approaches are possible:

1. When the memory is being migrated, allow the old memory to still be
accessible until the migration completes. Then "atomically" modify the
PTE to point to the new memory. Orchestrate this in a way that no
fault should occur.

2. Try to see if page faults can be avoided entirely by not executing
the offending code. In Android GC, a page movement algorithm is being
explored AFAIK where there is a "stop the world" pause preventing code
from executing until the memory is moved to the new location. mremap()
is used in userspace to do this move so I admit this is a bit
tangential, but conceptually the ideas are similar.

> > Last question, do I need the -rt kernel for preempt RCU?
>
> No, CONFIG_PREEMPT=y suffices.
>
> Note that CONFIG_PREEMPT_RT=y, AKA -rt, also makes spinlocks (but not
> raw spinlocks) be limited sleeplocks, and thus allows RCU read-side
> critical sections to block when acquiring these sleeping "spinlocks".
> But this is OK, because all of this is still subject to priority boosting.

Should PREEMPT_RT kernels not throw warnings though when calling
rcu_note_context_switch() in RCU read-side sections?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/rcu/tree_plugin.h#n331

I don't run a PREEMPT_RT kernel myself so I can't confirm if these
warnings somehow don't appear, but I figured it would be good to
double check in this discussion.

thanks,

- Joel