On Tue, Feb 2, 2021 at 4:43 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > On 2/2/21 9:46 AM, Yang Shi wrote: > > On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > >> On 1/29/21 12:46 PM, Yang Shi wrote: > >> ... > >>>> int next_demotion_node(int node) > >>>> { > >>>> - return node_demotion[node]; > >>>> + /* > >>>> + * node_demotion[] is updated without excluding > >>>> + * this function from running. READ_ONCE() avoids > >>>> + * reading multiple, inconsistent 'node' values > >>>> + * during an update. > >>>> + */ > >>> Don't we need a smp_rmb() here? The single write barrier might be not > >>> enough in migration target set. Typically a write barrier should be > >>> used in pairs with a read barrier. > >> I don't think we need one, practically. > >> > >> Since there is no locking against node_demotion[] updates, although a > >> smp_rmb() would ensure that this read is up-to-date, it could change > >> freely after the smp_rmb(). > > Yes, but this should be able to guarantee we see "disable + after" > > state. Isn't it more preferred? > > I'm debating how much of this is theoretical versus actually applicable > to what we have in the kernel. But, I'm generally worried about code > like this that *looks* innocuous: > > int terminal_node = start_node; > int next_node = next_demotion_node(start_node); > while (next_node != NUMA_NO_NODE) { > next_node = terminal_node; > terminal_node = next_demotion_node(terminal_node); > } > > That could loop forever if it doesn't go out to memory during each loop. > > However, if node_demotion[] *is* read on every trip through the loop, it > will eventually terminate. READ_ONCE() can guarantee that, as could > compiler barriers like smp_rmb(). > > But, after staring at it for a while, I think RCU may be the most > clearly correct way to solve the problem. Or, maybe just throw in the > towel and do a spinlock like a normal human being. :) > > Anyway, here's what I was thinking I'd do with RCU: > > 1. node_demotion[] starts off in a "before" state > 2. Writers to node_demotion[] first set the whole array such that > it will not induce cycles, like setting every member to > NUMA_NO_NODE. (the "disable" state) > 3. Writer calls synchronize_rcu(). After it returns, no readers can > observe the "before" values. > 4. Writer sets the actual values it wants. (the "after" state) > 5. Readers use rcu_read_lock() over any critical section where they > read the array. They are guaranteed to only see one of the two > adjacent states (before+disabled, or disabled+after), but never > before+after within one RCU read-side critical section. > 6. Readers use READ_ONCE() or some other compiler directive to ensure > the compiler does not reorder or combine reads from multiple, > adjacent RCU read-side critical sections. Makes sense to me. > > Although, after writing this, plain old locks are sounding awfully tempting.