Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path

Vinayak Menon <vinmenon@xxxxxxxxxxxxxx> · Tue, 10 Sep 2019 13:52:36 +0530

Hi Minchan,


On 9/10/2019 4:56 AM, Minchan Kim wrote:
> Hi Vinayak,
>
> On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
>> The following race is observed due to which a processes faulting
>> on a swap entry, finds the page neither in swapcache nor swap. This
>> causes zram to give a zero filled page that gets mapped to the
>> process, resulting in a user space crash later.
>>
>> Consider parent and child processes Pa and Pb sharing the same swap
>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>
>> Pa                                       Pb
>>
>> fault on VA                              fault on VA
>> do_swap_page                             do_swap_page
>> lookup_swap_cache fails                  lookup_swap_cache fails
>>                                          Pb scheduled out
>> swapin_readahead (deletes zram entry)
>> swap_free (makes swap_count 1)
>>                                          Pb scheduled in
>>                                          swap_readpage (swap_count == 1)
>>                                          Takes SWP_SYNCHRONOUS_IO path
>>                                          zram enrty absent
>>                                          zram gives a zero filled page
>>
>> Fix this by reading the swap_count before lookup_swap_cache, which conforms
>> with the order in which page is added to swap cache and swap count is
>> decremented in do_swap_page. In the race case above, this will let Pb take
>> the readahead path and thus pick the proper page from swapcache.
> Thanks for the report, Vinayak.
>
> It's a zram specific issue because it deallocates zram block
> unconditionally once read IO is done. The expectation was that dirty
> page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
> any more so I want to resolve the issue in zram specific code, not
> general one.


Thanks for comments Minchan.

Trying to understand your comment better.  With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will

make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid

entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the

race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or

SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram

entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.


>
> A idea in my mind is swap_slot_free_notify should check the slot
> reference counter and if it's higher than 1, it shouldn't free the
> slot until. What do you think about?

It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which

can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing

a valid swapcache entry ?

Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 063c0c1..a5ca05f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int __swap_count(swp_entry_t entry);
+extern bool __swap_has_cache(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
        return 0;
 }

+static bool __swap_has_cache(swp_entry_t entry)
+{
+       return 0;
+}
+
 static inline int __swp_swapcount(swp_entry_t entry)
 {
        return 0;

diff --git a/mm/memory.c b/mm/memory.c
index e0c232f..a13511f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                struct swap_info_struct *si = swp_swap_info(entry);

                if (si->flags & SWP_SYNCHRONOUS_IO &&
-                               __swap_count(entry) == 1) {
+                               __swap_count(entry) == 1 &&
+                               !__swap_has_cache(entry)) {
                        /* skip swapcache */
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
                                                        vmf->address);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 80445f4..2a1554a8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
        return count;
 }

+bool __swap_has_cache(swp_entry_t entry)
+{
+       struct swap_info_struct *si;
+       pgoff_t offset = swp_offset(entry);
+       bool has_cache  = false;
+
+       si = get_swap_device(entry);
+       if (si) {
+               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
+               put_swap_device(si);
+       }
+       return has_cache;
+}
+
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
        int count = 0;


>
>> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>
>> ---
>>  mm/memory.c | 21 ++++++++++++++++-----
>>  1 file changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index e0c232f..22643aa 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	struct page *page = NULL, *swapcache;
>>  	struct mem_cgroup *memcg;
>>  	swp_entry_t entry;
>> +	struct swap_info_struct *si;
>> +	bool skip_swapcache = false;
>>  	pte_t pte;
>>  	int locked;
>>  	int exclusive = 0;
>> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  
>>  
>>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>> +
>> +	/*
>> +	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
>> +	 * check is made, another process can populate the swapcache, delete
>> +	 * the swap entry and decrement the swap count. So decide on taking
>> +	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
>> +	 * race described, the victim process will find a swap_count > 1
>> +	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
>> +	 */
>> +	si = swp_swap_info(entry);
>> +	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
>> +		skip_swapcache = true;
>> +
>>  	page = lookup_swap_cache(entry, vma, vmf->address);
>>  	swapcache = page;
>>  
>>  	if (!page) {
>> -		struct swap_info_struct *si = swp_swap_info(entry);
>> -
>> -		if (si->flags & SWP_SYNCHRONOUS_IO &&
>> -				__swap_count(entry) == 1) {
>> -			/* skip swapcache */
>> +		if (skip_swapcache) {
>>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>>  							vmf->address);
>>  			if (page) {
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>