On Tue, Feb 20, 2024 at 4:42 PM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > On Tue, Feb 20, 2024 at 9:31 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song <ryncsn@xxxxxxxxx> wrote: > > > > > From: Kairui Song <kasong@xxxxxxxxxxx> > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > swapin the same entry at the same time, they get different pages (A, B). > > > Before one thread (T0) finishes the swapin and installs page (A) > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > swap_free the entry, then swap out the possibly modified page > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > if (!folio) { > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > __swap_count(entry) == 1) { > > > + /* > > > + * Prevent parallel swapin from proceeding with > > > + * the cache flag. Otherwise, another thread may > > > + * finish swapin first, free the entry, and swapout > > > + * reusing the same entry. It's undetectable as > > > + * pte_same() returns true due to entry reuse. > > > + */ > > > + if (swapcache_prepare(entry)) { > > > + /* Relax a bit to prevent rapid repeated page faults */ > > > + schedule_timeout_uninterruptible(1); > > > > Well this is unpleasant. How often can we expect this to occur? > > > > The chance is very low, using the current mainline kernel and ZRAM, > even with threads set to race on purpose using the reproducer I > provides, for 647132 page faults it occured 1528 times (~0.2%). > > If I run MySQL and sysbench with 128 threads and 16G buffer pool, with > 6G cgroup limit and 32G ZRAM, it occured 1372 times for 40 min, > 109930201 page faults in total (~0.001%). it might not be a problem for throughput. but for real-time and tail latency, this hurts. For example, this might increase dropping frames of UI which is an important parameter to evaluate performance :-) BTW, I wonder if ying's previous proposal - moving swapcache_prepare() after swap_read_folio() will further help decrease the number? Thanks Barry