Hi Matthew,
I don't believe execution of
unmerge_and_remove_all_rmap_items() after an mm is misplaced is guaranteed.
Consider the following interleaving:
Thread A executes __ksm_enter with KSM_RUN_MERGE set through the check on https://elixir.bootlin.com/linux/v5.18-rc5/source/mm/ksm.c#L2501
Thread B executes run_store and sets KSM_RUN_UNMERGE and then also executes
unmerge_and_remove_all_rmap_items() on https://elixir.bootlin.com/linux/v5.18-rc5/source/mm/ksm.c#L2900
Thread A completes __ksm_enter and misplaces the mm behind the scanning cursor since it is still on the KSM_RUN_MERGE path on https://elixir.bootlin.com/linux/v5.18-rc5/source/mm/ksm.c#L2504
I also noticed through manual inspection another check that appears racy of the KSM_RUN_UNMERGE flag on https://elixir.bootlin.com/linux/v5.18-rc5/source/mm/ksm.c#L2563
Best,
Gabe
On Tue, Aug 2, 2022 at 11:45 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
On Tue, Aug 02, 2022 at 11:15:50PM +0800, Kefeng Wang wrote:
> The ksm_run is alread protected by ksm_thread_mutex in run_store, we
> could add this lock in __ksm_enter() to avoid the above issue.
I don't think this is a great fix. Why not protect the store with
ksm_mmlist_lock? ie:
mutex_lock(&ksm_thread_mutex);
wait_while_offlining();
if (ksm_run != flags) {
+ spin_lock(&ksm_mmlist_lock);
ksm_run = flags;
+ spin_unlock(&ksm_mmlist_lock);
if (flags & KSM_RUN_UNMERGE) {
set_current_oom_origin();
err = unmerge_and_remove_all_rmap_items();
clear_current_oom_origin();
if (err) {
+ spin_lock(&ksm_mmlist_lock);
ksm_run = KSM_RUN_STOP;
+ spin_unlock(&ksm_mmlist_lock);
...
(I also don't think this is a real bug, because the call to
unmerge_and_remove_all_rmap_items() will "cure" the misplacement of
items in the list, but there's value in shutting up the tools, I suppose)