On 2/2/21 9:46 AM, Yang Shi wrote: > On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: >> On 1/29/21 12:46 PM, Yang Shi wrote: >> ... >>>> int next_demotion_node(int node) >>>> { >>>> - return node_demotion[node]; >>>> + /* >>>> + * node_demotion[] is updated without excluding >>>> + * this function from running. READ_ONCE() avoids >>>> + * reading multiple, inconsistent 'node' values >>>> + * during an update. >>>> + */ >>> Don't we need a smp_rmb() here? The single write barrier might be not >>> enough in migration target set. Typically a write barrier should be >>> used in pairs with a read barrier. >> I don't think we need one, practically. >> >> Since there is no locking against node_demotion[] updates, although a >> smp_rmb() would ensure that this read is up-to-date, it could change >> freely after the smp_rmb(). > Yes, but this should be able to guarantee we see "disable + after" > state. Isn't it more preferred? I'm debating how much of this is theoretical versus actually applicable to what we have in the kernel. But, I'm generally worried about code like this that *looks* innocuous: int terminal_node = start_node; int next_node = next_demotion_node(start_node); while (next_node != NUMA_NO_NODE) { next_node = terminal_node; terminal_node = next_demotion_node(terminal_node); } That could loop forever if it doesn't go out to memory during each loop. However, if node_demotion[] *is* read on every trip through the loop, it will eventually terminate. READ_ONCE() can guarantee that, as could compiler barriers like smp_rmb(). But, after staring at it for a while, I think RCU may be the most clearly correct way to solve the problem. Or, maybe just throw in the towel and do a spinlock like a normal human being. :) Anyway, here's what I was thinking I'd do with RCU: 1. node_demotion[] starts off in a "before" state 2. Writers to node_demotion[] first set the whole array such that it will not induce cycles, like setting every member to NUMA_NO_NODE. (the "disable" state) 3. Writer calls synchronize_rcu(). After it returns, no readers can observe the "before" values. 4. Writer sets the actual values it wants. (the "after" state) 5. Readers use rcu_read_lock() over any critical section where they read the array. They are guaranteed to only see one of the two adjacent states (before+disabled, or disabled+after), but never before+after within one RCU read-side critical section. 6. Readers use READ_ONCE() or some other compiler directive to ensure the compiler does not reorder or combine reads from multiple, adjacent RCU read-side critical sections. Although, after writing this, plain old locks are sounding awfully tempting.