Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path

Vinayak Menon <vinmenon@xxxxxxxxxxxxxx> · Wed, 11 Sep 2019 15:37:23 +0530

On 9/10/2019 11:21 PM, Minchan Kim wrote:
> On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote:
>> Hi Minchan,
>>
>>
>> On 9/10/2019 4:56 AM, Minchan Kim wrote:
>>> Hi Vinayak,
>>>
>>> On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
>>>> The following race is observed due to which a processes faulting
>>>> on a swap entry, finds the page neither in swapcache nor swap. This
>>>> causes zram to give a zero filled page that gets mapped to the
>>>> process, resulting in a user space crash later.
>>>>
>>>> Consider parent and child processes Pa and Pb sharing the same swap
>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>>>
>>>> Pa                                       Pb
>>>>
>>>> fault on VA                              fault on VA
>>>> do_swap_page                             do_swap_page
>>>> lookup_swap_cache fails                  lookup_swap_cache fails
>>>>                                          Pb scheduled out
>>>> swapin_readahead (deletes zram entry)
>>>> swap_free (makes swap_count 1)
>>>>                                          Pb scheduled in
>>>>                                          swap_readpage (swap_count == 1)
>>>>                                          Takes SWP_SYNCHRONOUS_IO path
>>>>                                          zram enrty absent
>>>>                                          zram gives a zero filled page
>>>>
>>>> Fix this by reading the swap_count before lookup_swap_cache, which conforms
>>>> with the order in which page is added to swap cache and swap count is
>>>> decremented in do_swap_page. In the race case above, this will let Pb take
>>>> the readahead path and thus pick the proper page from swapcache.
>>> Thanks for the report, Vinayak.
>>>
>>> It's a zram specific issue because it deallocates zram block
>>> unconditionally once read IO is done. The expectation was that dirty
>>> page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
>>> any more so I want to resolve the issue in zram specific code, not
>>> general one.
>>
>> Thanks for comments Minchan.
>>
>> Trying to understand your comment better.  With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will
>>
>> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid
>>
>> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the
>>
>> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or
>>
>> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram
>>
>> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.
>>
>>
>>> A idea in my mind is swap_slot_free_notify should check the slot
>>> reference counter and if it's higher than 1, it shouldn't free the
>>> slot until. What do you think about?
>> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which
>>
>> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing
> It's always trade-off between memory vs performance since it could hit
> in swap cache. If it's shared page, it's likely to hit a cache next time
> so we could get performance benefit.
>
> Actually, swap_slot_free_notify is layering violation so I wanted to
> replace it with discard hint in the long run so want to go the direction.

Okay got it.

>
>> a valid swapcache entry ?
>>
>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
> With your approach, what prevent below scenario?
>
> A                                                       B
>
>                                             do_swap_page
>                                             SWP_SYNCHRONOUS_IO && __swap_count == 1

As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read

the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)

If so, that read itself would have deleted the zram entry ? And the read page will be in swapcache and dirty ? In that case, with SWAP_HAS_CACHE

check in the patch, B will take readahead path. And shrink_page_list would attempt a pageout to zram, for the dirty page ?

> shrink_page_list
> add_to_swap
>     swap_count = 2
>
> ..
> ..
> do_swap_page
> swap_read
>     swap_slot_free_notify
>         zram's slot will be removed
>                                             page = alloc_page_vma
>                                             swap_readpage <-- read zero
>
>
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 063c0c1..a5ca05f 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
>>  extern sector_t swapdev_block(int, pgoff_t);
>>  extern int page_swapcount(struct page *);
>>  extern int __swap_count(swp_entry_t entry);
>> +extern bool __swap_has_cache(swp_entry_t entry);
>>  extern int __swp_swapcount(swp_entry_t entry);
>>  extern int swp_swapcount(swp_entry_t entry);
>>  extern struct swap_info_struct *page_swap_info(struct page *);
>> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
>>         return 0;
>>  }
>>
>> +static bool __swap_has_cache(swp_entry_t entry)
>> +{
>> +       return 0;
>> +}
>> +
>>  static inline int __swp_swapcount(swp_entry_t entry)
>>  {
>>         return 0;
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index e0c232f..a13511f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                 struct swap_info_struct *si = swp_swap_info(entry);
>>
>>                 if (si->flags & SWP_SYNCHRONOUS_IO &&
>> -                               __swap_count(entry) == 1) {
>> +                               __swap_count(entry) == 1 &&
>> +                               !__swap_has_cache(entry)) {
>>                         /* skip swapcache */
>>                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>>                                                         vmf->address);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 80445f4..2a1554a8 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
>>         return count;
>>  }
>>
>> +bool __swap_has_cache(swp_entry_t entry)
>> +{
>> +       struct swap_info_struct *si;
>> +       pgoff_t offset = swp_offset(entry);
>> +       bool has_cache  = false;
>> +
>> +       si = get_swap_device(entry);
>> +       if (si) {
>> +               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
>> +               put_swap_device(si);
>> +       }
>> +       return has_cache;
>> +}
>> +
>>  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
>>  {
>>         int count = 0;
>>
>>
>>>> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>
>>>> ---
>>>>  mm/memory.c | 21 ++++++++++++++++-----
>>>>  1 file changed, 16 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index e0c232f..22643aa 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  	struct page *page = NULL, *swapcache;
>>>>  	struct mem_cgroup *memcg;
>>>>  	swp_entry_t entry;
>>>> +	struct swap_info_struct *si;
>>>> +	bool skip_swapcache = false;
>>>>  	pte_t pte;
>>>>  	int locked;
>>>>  	int exclusive = 0;
>>>> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  
>>>>  
>>>>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>>>> +
>>>> +	/*
>>>> +	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
>>>> +	 * check is made, another process can populate the swapcache, delete
>>>> +	 * the swap entry and decrement the swap count. So decide on taking
>>>> +	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
>>>> +	 * race described, the victim process will find a swap_count > 1
>>>> +	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
>>>> +	 */
>>>> +	si = swp_swap_info(entry);
>>>> +	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
>>>> +		skip_swapcache = true;
>>>> +
>>>>  	page = lookup_swap_cache(entry, vma, vmf->address);
>>>>  	swapcache = page;
>>>>  
>>>>  	if (!page) {
>>>> -		struct swap_info_struct *si = swp_swap_info(entry);
>>>> -
>>>> -		if (si->flags & SWP_SYNCHRONOUS_IO &&
>>>> -				__swap_count(entry) == 1) {
>>>> -			/* skip swapcache */
>>>> +		if (skip_swapcache) {
>>>>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>>>>  							vmf->address);
>>>>  			if (page) {
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>