Re: Taking page faults in RCU critical sections

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Sun, 5 Jan 2025 11:21:29 -0800

On Sat, Jan 04, 2025 at 08:56:19PM +0000, Karim Manaouil wrote:
> On Thu, Jan 02, 2025 at 11:16:11AM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 02, 2025 at 06:23:43PM +0000, Karim Manaouil wrote:
> > > Hi Paul,
> > > 
> > > First, I wish you a happy new year!
> > 
> > Hello, Karim, and a very happy square new year to you and yours as well!
> > I have added the rcu email list in case someone else has ideas.
> > 
> > > I am working on implementing page migration for some types of kernel
> > > memory. My technique is to remap those kernel pages in vmap/vmalloc 
> > > area and allow the kernel to take page faults during page migration.
> > > 
> > > However, I have the problem of spinlocks and RCU critical sections.
> > > A page fault can occur while the kernel is inside an RCU read critical
> > > section. For example, in fs/dcache.c:dget_parent():
> > > 
> > > rcu_read_lock()
> > > seq = raw_seqcount_begin(&dentry->d_seq);
> > > rcu_read_unlock()
> > > 
> > > If the kernel page where "dentry" belongs to is undergoing migration,
> > > a page fault could occur on the CPU executing the code above, when the
> > > migration thread (running on another CPU) clears the corresponding 
> > > PTE entry in vmap and flushes the TLB (but the new page is not mapped
> > > yet).
> > > 
> > > The page table entries are replaced by migration entries, and the CPU,
> > > on which the page fault happened, will have to wait or spin in the page
> > > fault handler until the migration is complete (success or failure).
> > > 
> > > With calssical RCU, I cannot wait in the page fault handler (like it's
> > > done in migration_entry_wait()) because that's explicit blocking and 
> > > that's prohihited.
> > 
> > Indeed it is, and by design.
> > 
> > > I tried to spin in the fault handler with something like
> > > 
> > > for (;;) {
> > > 	pte = ptep_get_lockless(ptep);
> > > 	if (pte_none(pte) || pte_present(pte))
> > > 		break;
> > > 	cpu_relax();
> > > }
> > > 
> > > But the entire system stopped working (I assume because rcu_synchronise()
> > > on other CPUs is waiting for us and we are waiting for other CPUs, so a
> > > deadlock situation).
> > > 
> > > I realised that I need something like preempt RCU. Would the cpu_relax()
> > > above work with preempt RCU?
> > 
> > You would need something like cond_resched(), but you cannot use this
> > within an RCU read-side critical section.  And spinning in this manner
> > within a fault handler is not a good idea.  You will likely get lockups
> > and stalls of various sorts.
> > 
> > Preemptible RCU permits preemption, but not unconditional blocking.
> > The reason for this is that a preempted reader can be subjected to RCU
> > priority boosting, but if a reader were to block, priority boosting
> > would not help.
> > 
> > The reason that we need priority boosting to help is that blocked RCU
> > readers stall the current RCU grace period, which means that any memory
> > waiting to be freed continues waiting, eventually resulting in OOM.
> > Of course, OOMs are not good for your kernel's uptime, hence the
> > restriction against general blocking in RCU readers.
> 
> I believe not only OOM, but it could also lead to a deadlock, as I observed 
> in my small experiments. Basically, one CPU (0) was blocked inside an RCU
> region, waiting for another CPU (1), running the page migration/compaction
> thread, but the migration thread itself (on CPU1) was trying to free some 
> memory and it had to first wait for the existing RCU readers, amongst them
> CPU0, and that lead to circular waiting (CPU0 waiting for CPU1, but
> CPU1 ends up waiting for CPU0).

Yes, making an RCU read-side critical section wait, whether directly or
indirectly, on an RCU grace period is a good way to achieve deadlock.

> > Please note that spinlocks have this same restriction.  Sleeping while
> > holding a spinlock can result in deadlock, which is even worse for your
> > kernel's uptime.
> > 
> > > Do you have any ideas for how to properly approach this problem?
> > 
> > Here are a few to start with:
> > 
> > 0.	Look at the existing code that migrates processes and/or kernels
> > 	from one system to another, and then do whatever they do.
> > 
> > 1.	Allocate the needed memory up front, before acquiring the
> > 	spinlocks and before entering the RCU readers.
> > 
> > 2.	Move the existing spinlocks to mutexes and the existing uses
> > 	of RCU to SRCU, perhaps using srcu_read_lock_lite().  But note
> > 	that a great deal of review and benchmarking will be necessary
> > 	to prove that there are no regressions.  And that changes of
> > 	this sort in mm almost always result in regressions.
> > 
> > 	So I strongly advise you not to take this approach lightly.
> > 
> > 3.	Your ideas here!
> 
> For (0), it seems that most of the solutions along those lines are "stop
> the world" kind of solutions, which is not ideal.

How about the solutions that are not "stop the world"?

> I thought about (2) before I sent you the email, but then I was
> skeptical for the same reasons you listed.

Fair enough!

> I believe that another variation of (2) is the solution to this problem.
> 
> In fact, there is a very small window in which an RCU reader can trigger
> a page fault, which is the window between flushing the TLB and updating
> the page table entry.
> 
> This makes think that to prevent the deadlock situation above, I need to 
> make sure that the page migration/compaction path should never wait for RCU
> readers. In this case, the RCU reader will wait (spinning) for a bounded
> amount of time which is the amount of time needed to close the window
> described above: copy the contents of the old page to the new page, update
> the page table entry and make the writes visible to the spinning RCU reader,
> no blocking, no scheduling and no grace periods to wait for.
> 
> Do you think this is a sane approach? Obviously one down side is
> burning CPU cycles while spinning, but it should be a small enough
> amount of time.

Maybe?

If you are running in a guest OS, can vCPU preemption cause trouble?
There are lots of moving parts in TLB flushing, so have you checked all
the ones that you need to in this case?

One (rough!) rule of thumb is that if you can use a spinlock to protect
the race window you are concerned about, then it is OK to spin waiting
for that race window from within an RCU read-side critical section.

But as always, this rule is no substitute for understanding the
interactions.

							Thanx, Paul

> > > Last question, do I need the -rt kernel for preempt RCU?
> > 
> > No, CONFIG_PREEMPT=y suffices.
> > 
> > Note that CONFIG_PREEMPT_RT=y, AKA -rt, also makes spinlocks (but not
> > raw spinlocks) be limited sleeplocks, and thus allows RCU read-side
> > critical sections to block when acquiring these sleeping "spinlocks".
> > But this is OK, because all of this is still subject to priority boosting.
> > 
> > 							Thanx, Paul
> 
> Thank you!
> 
> -- 
> Best,
> Karim
> Edinburgh University