Re: Taking page faults in RCU critical sections

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 2 Jan 2025 11:16:11 -0800

On Thu, Jan 02, 2025 at 06:23:43PM +0000, Karim Manaouil wrote:
> Hi Paul,
> 
> First, I wish you a happy new year!

Hello, Karim, and a very happy square new year to you and yours as well!
I have added the rcu email list in case someone else has ideas.

> I am working on implementing page migration for some types of kernel
> memory. My technique is to remap those kernel pages in vmap/vmalloc 
> area and allow the kernel to take page faults during page migration.
> 
> However, I have the problem of spinlocks and RCU critical sections.
> A page fault can occur while the kernel is inside an RCU read critical
> section. For example, in fs/dcache.c:dget_parent():
> 
> rcu_read_lock()
> seq = raw_seqcount_begin(&dentry->d_seq);
> rcu_read_unlock()
> 
> If the kernel page where "dentry" belongs to is undergoing migration,
> a page fault could occur on the CPU executing the code above, when the
> migration thread (running on another CPU) clears the corresponding 
> PTE entry in vmap and flushes the TLB (but the new page is not mapped
> yet).
> 
> The page table entries are replaced by migration entries, and the CPU,
> on which the page fault happened, will have to wait or spin in the page
> fault handler until the migration is complete (success or failure).
> 
> With calssical RCU, I cannot wait in the page fault handler (like it's
> done in migration_entry_wait()) because that's explicit blocking and 
> that's prohihited.

Indeed it is, and by design.

> I tried to spin in the fault handler with something like
> 
> for (;;) {
> 	pte = ptep_get_lockless(ptep);
> 	if (pte_none(pte) || pte_present(pte))
> 		break;
> 	cpu_relax();
> }
> 
> But the entire system stopped working (I assume because rcu_synchronise()
> on other CPUs is waiting for us and we are waiting for other CPUs, so a
> deadlock situation).
> 
> I realised that I need something like preempt RCU. Would the cpu_relax()
> above work with preempt RCU?

You would need something like cond_resched(), but you cannot use this
within an RCU read-side critical section.  And spinning in this manner
within a fault handler is not a good idea.  You will likely get lockups
and stalls of various sorts.

Preemptible RCU permits preemption, but not unconditional blocking.
The reason for this is that a preempted reader can be subjected to RCU
priority boosting, but if a reader were to block, priority boosting
would not help.

The reason that we need priority boosting to help is that blocked RCU
readers stall the current RCU grace period, which means that any memory
waiting to be freed continues waiting, eventually resulting in OOM.
Of course, OOMs are not good for your kernel's uptime, hence the
restriction against general blocking in RCU readers.

Please note that spinlocks have this same restriction.  Sleeping while
holding a spinlock can result in deadlock, which is even worse for your
kernel's uptime.

> Do you have any ideas for how to properly approach this problem?

Here are a few to start with:

0.	Look at the existing code that migrates processes and/or kernels
	from one system to another, and then do whatever they do.

1.	Allocate the needed memory up front, before acquiring the
	spinlocks and before entering the RCU readers.

2.	Move the existing spinlocks to mutexes and the existing uses
	of RCU to SRCU, perhaps using srcu_read_lock_lite().  But note
	that a great deal of review and benchmarking will be necessary
	to prove that there are no regressions.  And that changes of
	this sort in mm almost always result in regressions.

	So I strongly advise you not to take this approach lightly.

3.	Your ideas here!

> Last question, do I need the -rt kernel for preempt RCU?

No, CONFIG_PREEMPT=y suffices.

Note that CONFIG_PREEMPT_RT=y, AKA -rt, also makes spinlocks (but not
raw spinlocks) be limited sleeplocks, and thus allows RCU read-side
critical sections to block when acquiring these sleeping "spinlocks".
But this is OK, because all of this is still subject to priority boosting.

							Thanx, Paul