On Sun, Jan 05, 2025 at 11:21:29AM -0800, Paul E. McKenney wrote: > On Sat, Jan 04, 2025 at 08:56:19PM +0000, Karim Manaouil wrote: > > On Thu, Jan 02, 2025 at 11:16:11AM -0800, Paul E. McKenney wrote: > > > On Thu, Jan 02, 2025 at 06:23:43PM +0000, Karim Manaouil wrote: > > > > Hi Paul, > > > > > > > > First, I wish you a happy new year! > > > > > > Hello, Karim, and a very happy square new year to you and yours as well! > > > I have added the rcu email list in case someone else has ideas. > > > > > > > I am working on implementing page migration for some types of kernel > > > > memory. My technique is to remap those kernel pages in vmap/vmalloc > > > > area and allow the kernel to take page faults during page migration. > > > > > > > > However, I have the problem of spinlocks and RCU critical sections. > > > > A page fault can occur while the kernel is inside an RCU read critical > > > > section. For example, in fs/dcache.c:dget_parent(): > > > > > > > > rcu_read_lock() > > > > seq = raw_seqcount_begin(&dentry->d_seq); > > > > rcu_read_unlock() > > > > > > > > If the kernel page where "dentry" belongs to is undergoing migration, > > > > a page fault could occur on the CPU executing the code above, when the > > > > migration thread (running on another CPU) clears the corresponding > > > > PTE entry in vmap and flushes the TLB (but the new page is not mapped > > > > yet). > > > > > > > > The page table entries are replaced by migration entries, and the CPU, > > > > on which the page fault happened, will have to wait or spin in the page > > > > fault handler until the migration is complete (success or failure). > > > > > > > > With calssical RCU, I cannot wait in the page fault handler (like it's > > > > done in migration_entry_wait()) because that's explicit blocking and > > > > that's prohihited. > > > > > > Indeed it is, and by design. > > > > > > > I tried to spin in the fault handler with something like > > > > > > > > for (;;) { > > > > pte = ptep_get_lockless(ptep); > > > > if (pte_none(pte) || pte_present(pte)) > > > > break; > > > > cpu_relax(); > > > > } > > > > > > > > But the entire system stopped working (I assume because rcu_synchronise() > > > > on other CPUs is waiting for us and we are waiting for other CPUs, so a > > > > deadlock situation). > > > > > > > > I realised that I need something like preempt RCU. Would the cpu_relax() > > > > above work with preempt RCU? > > > > > > You would need something like cond_resched(), but you cannot use this > > > within an RCU read-side critical section. And spinning in this manner > > > within a fault handler is not a good idea. You will likely get lockups > > > and stalls of various sorts. > > > > > > Preemptible RCU permits preemption, but not unconditional blocking. > > > The reason for this is that a preempted reader can be subjected to RCU > > > priority boosting, but if a reader were to block, priority boosting > > > would not help. > > > > > > The reason that we need priority boosting to help is that blocked RCU > > > readers stall the current RCU grace period, which means that any memory > > > waiting to be freed continues waiting, eventually resulting in OOM. > > > Of course, OOMs are not good for your kernel's uptime, hence the > > > restriction against general blocking in RCU readers. > > > > I believe not only OOM, but it could also lead to a deadlock, as I observed > > in my small experiments. Basically, one CPU (0) was blocked inside an RCU > > region, waiting for another CPU (1), running the page migration/compaction > > thread, but the migration thread itself (on CPU1) was trying to free some > > memory and it had to first wait for the existing RCU readers, amongst them > > CPU0, and that lead to circular waiting (CPU0 waiting for CPU1, but > > CPU1 ends up waiting for CPU0). > > Yes, making an RCU read-side critical section wait, whether directly or > indirectly, on an RCU grace period is a good way to achieve deadlock. > > > > Please note that spinlocks have this same restriction. Sleeping while > > > holding a spinlock can result in deadlock, which is even worse for your > > > kernel's uptime. > > > > > > > Do you have any ideas for how to properly approach this problem? > > > > > > Here are a few to start with: > > > > > > 0. Look at the existing code that migrates processes and/or kernels > > > from one system to another, and then do whatever they do. > > > > > > 1. Allocate the needed memory up front, before acquiring the > > > spinlocks and before entering the RCU readers. > > > > > > 2. Move the existing spinlocks to mutexes and the existing uses > > > of RCU to SRCU, perhaps using srcu_read_lock_lite(). But note > > > that a great deal of review and benchmarking will be necessary > > > to prove that there are no regressions. And that changes of > > > this sort in mm almost always result in regressions. > > > > > > So I strongly advise you not to take this approach lightly. > > > > > > 3. Your ideas here! > > > > For (0), it seems that most of the solutions along those lines are "stop > > the world" kind of solutions, which is not ideal. > > How about the solutions that are not "stop the world"? > > > I thought about (2) before I sent you the email, but then I was > > skeptical for the same reasons you listed. > > Fair enough! > > > I believe that another variation of (2) is the solution to this problem. > > > > In fact, there is a very small window in which an RCU reader can trigger > > a page fault, which is the window between flushing the TLB and updating > > the page table entry. > > > > This makes think that to prevent the deadlock situation above, I need to > > make sure that the page migration/compaction path should never wait for RCU > > readers. In this case, the RCU reader will wait (spinning) for a bounded > > amount of time which is the amount of time needed to close the window > > described above: copy the contents of the old page to the new page, update > > the page table entry and make the writes visible to the spinning RCU reader, > > no blocking, no scheduling and no grace periods to wait for. > > > > Do you think this is a sane approach? Obviously one down side is > > burning CPU cycles while spinning, but it should be a small enough > > amount of time. > > Maybe? > > If you are running in a guest OS, can vCPU preemption cause trouble? > There are lots of moving parts in TLB flushing, so have you checked all > the ones that you need to in this case? Great points! I'll investigate the vCPU preeemption case. Thanks, Paul! > One (rough!) rule of thumb is that if you can use a spinlock to protect > the race window you are concerned about, then it is OK to spin waiting > for that race window from within an RCU read-side critical section. > > But as always, this rule is no substitute for understanding the > interactions. I think that should be the case. I am trying to run real world tests and see how it goes. I am cleaning SLUB to make it easier to isolate slab folios and then I'll have the chance to get some early results/observations. Thanks for the feedback, Paul! > Thanx, Paul > > > > > Last question, do I need the -rt kernel for preempt RCU? > > > > > > No, CONFIG_PREEMPT=y suffices. > > > > > > Note that CONFIG_PREEMPT_RT=y, AKA -rt, also makes spinlocks (but not > > > raw spinlocks) be limited sleeplocks, and thus allows RCU read-side > > > critical sections to block when acquiring these sleeping "spinlocks". > > > But this is OK, because all of this is still subject to priority boosting. > > > > > > Thanx, Paul > > > > Thank you! > > > > -- > > Best, > > Karim > > Edinburgh University -- Best, Karim Edinburgh University