Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path

Vinayak Menon <vinmenon@xxxxxxxxxxxxxx> · Tue, 3 Sep 2019 11:43:16 +0530

Hi Michal,

Thanks for reviewing this.


On 9/2/2019 6:51 PM, Michal Hocko wrote:
> On Fri 30-08-19 18:13:31, Vinayak Menon wrote:
>> The following race is observed due to which a processes faulting
>> on a swap entry, finds the page neither in swapcache nor swap. This
>> causes zram to give a zero filled page that gets mapped to the
>> process, resulting in a user space crash later.
>>
>> Consider parent and child processes Pa and Pb sharing the same swap
>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>
>> Pa                                       Pb
>>
>> fault on VA                              fault on VA
>> do_swap_page                             do_swap_page
>> lookup_swap_cache fails                  lookup_swap_cache fails
>>                                          Pb scheduled out
>> swapin_readahead (deletes zram entry)
>> swap_free (makes swap_count 1)
>>                                          Pb scheduled in
>>                                          swap_readpage (swap_count == 1)
>>                                          Takes SWP_SYNCHRONOUS_IO path
>>                                          zram enrty absent
>>                                          zram gives a zero filled page
> This sounds like a zram issue, right? Why is a generic swap path changed
> then?


I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal.

This is because zram avoids lazy swap slot freeing by implementing gendisk->fops->swap_slot_free_notify

and swap_slot_free_notify deletes the zram entry because the page is in swapcache.

The issue is that Pb attempted a swapcache lookup before Pa brought the page to swapcache, and failed. If

Pb had taken the swapin_readahead path, __read_swap_cache_async would have performed a second lookup

and found the page in swapcache. The issue here is that due to the lookup failure and swap_count being 1,

it takes the  SWP_SYNCHRONOUS_IO path and does a direct read which is bound to fail. So it seems to me as

a problem in the way SWP_SYNCHRONOUS_IO is handled in do_swap_page, and not a problem with zram.

Any swap device that sets SWP_SYNCHRONOUS_IO and implements swap_slot_free_notify can hit this bug.

do_swap_page first brings in the page to swapcache and then decrements the swap_count, and SWP_SYNCHRONOUS_IO

code in do_swap_page performs the swapcache and swap_count checks in the same order. Due to thread preemption

described in the sequence above, it can happen that the SWP_SYNCHRONOUS_IO path fails in swapcache check, but sees

the swap_count decremented later, thus missing a valid swapcache entry.

I have not tested, but the following patch may also fix the issue.

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 063c0c1..a5ca05f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int __swap_count(swp_entry_t entry);
+extern bool __swap_has_cache(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
        return 0;
 }

+static bool __swap_has_cache(swp_entry_t entry)
+{
+       return 0;
+}
+
 static inline int __swp_swapcount(swp_entry_t entry)
 {
        return 0;

diff --git a/mm/memory.c b/mm/memory.c
index e0c232f..a13511f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                struct swap_info_struct *si = swp_swap_info(entry);

                if (si->flags & SWP_SYNCHRONOUS_IO &&
-                               __swap_count(entry) == 1) {
+                               __swap_count(entry) == 1 &&
+                               !__swap_has_cache(entry)) {
                        /* skip swapcache */
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
                                                        vmf->address);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 80445f4..2a1554a8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
        return count;
 }

+bool __swap_has_cache(swp_entry_t entry)
+{
+       struct swap_info_struct *si;
+       pgoff_t offset = swp_offset(entry);
+       bool has_cache  = false;
+
+       si = get_swap_device(entry);
+       if (si) {
+               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
+               put_swap_device(si);
+       }
+       return has_cache;
+}
+
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
        int count = 0;


>
>> Fix this by reading the swap_count before lookup_swap_cache, which conforms
>> with the order in which page is added to swap cache and swap count is
>> decremented in do_swap_page. In the race case above, this will let Pb take
>> the readahead path and thus pick the proper page from swapcache.
>>
>> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>