On 2024/2/20 13:32, Kairui Song wrote: > On Tue, Feb 20, 2024 at 12:49 PM Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> > wrote: >> >> On 2024/2/20 06:10, Barry Song wrote: >>> On Mon, Feb 19, 2024 at 9:21 PM Kairui Song <ryncsn@xxxxxxxxx> wrote: >>>> >>>> From: Kairui Song <kasong@xxxxxxxxxxx> >>>> >>>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads >>>> swapin the same entry at the same time, they get different pages (A, > B). >>>> Before one thread (T0) finishes the swapin and installs page (A) >>>> to the PTE, another thread (T1) could finish swapin of page (B), >>>> swap_free the entry, then swap out the possibly modified page >>>> reusing the same entry. It breaks the pte_same check in (T0) because >>>> PTE value is unchanged, causing ABA problem. Thread (T0) will >>>> install a stalled page (A) into the PTE and cause data corruption. >>>> >>>> One possible callstack is like this: >>>> >>>> CPU0 CPU1 >>>> ---- ---- >>>> do_swap_page() do_swap_page() with same entry >>>> <direct swapin path> <direct swapin path> >>>> <alloc page A> <alloc page B> >>>> swap_read_folio() <- read to page A swap_read_folio() <- read to page > B >>>> <slow on later locks or interrupt> <finished swapin first> >>>> .. set_pte_at() >>>> swap_free() <- entry is free >>>> <write to page B, now page A > stalled> >>>> <swap out page B to same swap > entry> >>>> pte_same() <- Check pass, PTE seems >>>> unchanged, but page A >>>> is stalled! >>>> swap_free() <- page B content lost! >>>> set_pte_at() <- staled page A installed! >>>> >>>> And besides, for ZRAM, swap_free() allows the swap device to discard >>>> the entry content, so even if page (B) is not modified, if >>>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1, >>>> it may also cause data loss. >>>> >>>> To fix this, reuse swapcache_prepare which will pin the swap entry > using >>>> the cache flag, and allow only one thread to swap it in, also prevent >>>> any parallel code from putting the entry in the cache. Release the pin >>>> after PT unlocked. >>>> >>>> Racers just loop and wait since it's a rare and very short event. >>>> A schedule_timeout_uninterruptible(1) call is added to avoid repeated >>>> page faults wasting too much CPU, causing livelock or adding too much >>>> noise to perf statistics. A similar livelock issue was described in >>>> commit 029c4628b2eb ("mm: swap: get rid of livelock in swapin > readahead") >>>> >>>> Reproducer: >>>> >>>> This race issue can be triggered easily using a well constructed >>>> reproducer and patched brd (with a delay in read path) [1]: >>>> >>>> With latest 6.8 mainline, race caused data loss can be observed easily: >>>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out >>>> Polulating 32MB of memory region... >>>> Keep swapping out... >>>> Starting round 0... >>>> Spawning 65536 workers... >>>> 32746 workers spawned, wait for done... >>>> Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! >>>> Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! >>>> Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! >>>> Round 0 Failed, 15 data loss! >>>> >>>> This reproducer spawns multiple threads sharing the same memory region >>>> using a small swap device. Every two threads updates mapped pages one > by >>>> one in opposite direction trying to create a race, with one dedicated >>>> thread keep swapping out the data out using madvise. >>>> >>>> The reproducer created a reproduce rate of about once every 5 minutes, >>>> so the race should be totally possible in production. >>>> >>>> After this patch, I ran the reproducer for over a few hundred rounds >>>> and no data loss observed. >>>> >>>> Performance overhead is minimal, microbenchmark swapin 10G from 32G >>>> zram: >>>> >>>> Before: 10934698 us >>>> After: 11157121 us >>>> Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) >>>> >>>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of > synchronous device") >>>> Link: > https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] >>>> Reported-by: "Huang, Ying" <ying.huang@xxxxxxxxx> >>>> Closes: > https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ >>>> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx> >>>> Cc: stable@xxxxxxxxxxxxxxx >>>> >>>> --- >>>> V3: > https://lore.kernel.org/all/20240216095105.14502-1-ryncsn@xxxxxxxxx/ >>>> Update from V3: >>>> - Use schedule_timeout_uninterruptible(1) for now instead of > schedule() to >>>> prevent the busy faulting task holds CPU and livelocks [Huang, Ying] >>>> >>>> V2: > https://lore.kernel.org/all/20240206182559.32264-1-ryncsn@xxxxxxxxx/ >>>> Update from V2: >>>> - Add a schedule() if raced to prevent repeated page faults wasting CPU >>>> and add noise to perf statistics. >>>> - Use a bool to state the special case instead of reusing existing >>>> variables fixing error handling [Minchan Kim]. >>>> >>>> V1: https://lore.kernel.org/all/20240205110959.4021-1-ryncsn@xxxxxxxxx/ >>>> Update from V1: >>>> - Add some words on ZRAM case, it will discard swap content on > swap_free >>>> so the race window is a bit different but cure is the same. [Barry > Song] >>>> - Update comments make it cleaner [Huang, Ying] >>>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae > Park] >>>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO >>>> instead of "direct swapin path" [Yu Zhao] >>>> - Update commit message. >>>> - Collect Review and Acks. >>>> >>>> include/linux/swap.h | 5 +++++ >>>> mm/memory.c | 20 ++++++++++++++++++++ >>>> mm/swap.h | 5 +++++ >>>> mm/swapfile.c | 13 +++++++++++++ >>>> 4 files changed, 43 insertions(+) >>>> >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index 4db00ddad261..8d28f6091a32 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) >>>> return 0; >>>> } >>>> >>>> +static inline int swapcache_prepare(swp_entry_t swp) >>>> +{ >>>> + return 0; >>>> +} >>>> + >>>> static inline void swap_free(swp_entry_t swp) >>>> { >>>> } >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 7e1f4849463a..a99f5e7be9a5 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -3799,6 +3799,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> struct page *page; >>>> struct swap_info_struct *si = NULL; >>>> rmap_t rmap_flags = RMAP_NONE; >>>> + bool need_clear_cache = false; >>>> bool exclusive = false; >>>> swp_entry_t entry; >>>> pte_t pte; >>>> @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> if (!folio) { >>>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >>>> __swap_count(entry) == 1) { >>>> + /* >>>> + * Prevent parallel swapin from proceeding with >>>> + * the cache flag. Otherwise, another thread > may >>>> + * finish swapin first, free the entry, and > swapout >>>> + * reusing the same entry. It's undetectable as >>>> + * pte_same() returns true due to entry reuse. >>>> + */ >>>> + if (swapcache_prepare(entry)) { >>>> + /* Relax a bit to prevent rapid > repeated page faults */ >>>> + schedule_timeout_uninterruptible(1); >>> >>> Not a ideal model, imaging two tasks, >>> >>> task A - low priority running on a LITTLE core >>> task B - high priority and have real-time requirements such as audio, >>> graphics running on a big core. >>> >>> The original code will make B win even if it is a bit later than A as > its CPU is >>> much faster to finish swap_read_folio for example from zRAM. task B's >>> swap-in can finish very soon. >>> >>> With the patch, B will wait a tick and its real-time performance will be >>> negatively affected from time to time once low priority and high > priority >>> tasks fault in the same PTE and high priority tasks are a bit later than >>> low priority tasks. This is a kind of priority inversion. >>> >>> When we support large folio swap-in, things can get worse. For example, >>> to swap-in 16 or even more pages in one do_swap_page, the chance for >>> task A and task B located in the same range of 16 PTEs will increase >>> though they are not located in the same PTE. >>> >>> Please consider this is not a blocker for this patch. But I will put > the problem >>> in my list and run some real tests on Android phones later. >> >> Good point. Late for the discussion, I'm wondering why not get an extra > reference >> on the swap entry, instead of swapcache_prepare()? Then the faster thread > will >> succeed, but can't free the swap entry. Later, the slower thread will > find the >> changed pte value and fail, and free the swap entry. Maybe I missed > something? > > Hi, Chengming > > That was my initial purpose. Then found a lot of problems with it. Increase > swap count here, it may race with another swap free and end up increasing > the swap count of a freed entry. > > That can be fixed with audits and new helpers, but there are many other > potential issues too. One major problem is that after count bump, raced > swap threads will fallback to cached swap in. Pages in swapcache can be > swaped out without allocating an entry, making the problem we were trying > to resolve more serious. Thanks for your clarification! Right, there are many issues I just ignored...