On Mon, Nov 25, 2024 at 09:52:58AM +0800, Long Li wrote: > There is a race condition between inodegc queue and inodegc worker where > the cpumask bit may not be set when concurrent operations occur. What problems does this cause? i.e. how do we identify systems hitting this issue? > > Current problematic sequence: > > CPU0 CPU1 > -------------------- --------------------- > xfs_inodegc_queue() xfs_inodegc_worker() > llist_del_all(&gc->list) > llist_add(&ip->i_gclist, &gc->list) > cpumask_test_and_set_cpu() > cpumask_clear_cpu() > < cpumask not set > > > Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure > proper ordering. This change ensures that when the worker thread clears > the cpumask, any concurrent queue operations will either properly set > the cpumask bit or have already emptied the list. > > Also remove unnecessary smp_mb__{before/after}_atomic() barriers since > the llist_* operations already provide required ordering semantics. it > make the code cleaner. IIRC, the barriers were for ordering the cpumask bitmap ops against llist operations. There are calls elsewhere to for_each_cpu() that then use llist_empty() checks (e.g xfs_inodegc_queue_all/wait_all), so on relaxed architectures (like alpha) I think we have to ensure the bitmask ops carried full ordering against the independent llist ops themselves. i.e. llist_empty() just uses READ_ONCE, so it only orders against other llist ops and won't guarantee any specific ordering against against cpumask modifications. I could be remembering incorrectly, but I think that was the original reason for the barriers. Can you please confirm that the cpumask iteration/llist_empty checks do not need these bitmask barriers anymore? If that's ok, then the change looks fine. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx