Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path

Minchan Kim <minchan@xxxxxxxxxx> · Tue, 10 Sep 2019 10:51:16 -0700

On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote:
> Hi Minchan,
> 
> 
> On 9/10/2019 4:56 AM, Minchan Kim wrote:
> > Hi Vinayak,
> >
> > On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
> >> The following race is observed due to which a processes faulting
> >> on a swap entry, finds the page neither in swapcache nor swap. This
> >> causes zram to give a zero filled page that gets mapped to the
> >> process, resulting in a user space crash later.
> >>
> >> Consider parent and child processes Pa and Pb sharing the same swap
> >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
> >> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
> >>
> >> Pa                                       Pb
> >>
> >> fault on VA                              fault on VA
> >> do_swap_page                             do_swap_page
> >> lookup_swap_cache fails                  lookup_swap_cache fails
> >>                                          Pb scheduled out
> >> swapin_readahead (deletes zram entry)
> >> swap_free (makes swap_count 1)
> >>                                          Pb scheduled in
> >>                                          swap_readpage (swap_count == 1)
> >>                                          Takes SWP_SYNCHRONOUS_IO path
> >>                                          zram enrty absent
> >>                                          zram gives a zero filled page
> >>
> >> Fix this by reading the swap_count before lookup_swap_cache, which conforms
> >> with the order in which page is added to swap cache and swap count is
> >> decremented in do_swap_page. In the race case above, this will let Pb take
> >> the readahead path and thus pick the proper page from swapcache.
> > Thanks for the report, Vinayak.
> >
> > It's a zram specific issue because it deallocates zram block
> > unconditionally once read IO is done. The expectation was that dirty
> > page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
> > any more so I want to resolve the issue in zram specific code, not
> > general one.
> 
> 
> Thanks for comments Minchan.
> 
> Trying to understand your comment better.  With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will
> 
> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid
> 
> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the
> 
> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or
> 
> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram
> 
> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.
> 
> 
> >
> > A idea in my mind is swap_slot_free_notify should check the slot
> > reference counter and if it's higher than 1, it shouldn't free the
> > slot until. What do you think about?
> 
> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which
> 
> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing

It's always trade-off between memory vs performance since it could hit
in swap cache. If it's shared page, it's likely to hit a cache next time
so we could get performance benefit.

Actually, swap_slot_free_notify is layering violation so I wanted to
replace it with discard hint in the long run so want to go the direction.

> 
> a valid swapcache entry ?
> 
> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?

With your approach, what prevent below scenario?

A                                                       B

                                            do_swap_page
                                            SWP_SYNCHRONOUS_IO && __swap_count == 1
shrink_page_list
add_to_swap
    swap_count = 2

..
..
do_swap_page
swap_read
    swap_slot_free_notify
        zram's slot will be removed
                                            page = alloc_page_vma
                                            swap_readpage <-- read zero

> 
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 063c0c1..a5ca05f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
>  extern int __swap_count(swp_entry_t entry);
> +extern bool __swap_has_cache(swp_entry_t entry);
>  extern int __swp_swapcount(swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
>         return 0;
>  }
> 
> +static bool __swap_has_cache(swp_entry_t entry)
> +{
> +       return 0;
> +}
> +
>  static inline int __swp_swapcount(swp_entry_t entry)
>  {
>         return 0;
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e0c232f..a13511f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 struct swap_info_struct *si = swp_swap_info(entry);
> 
>                 if (si->flags & SWP_SYNCHRONOUS_IO &&
> -                               __swap_count(entry) == 1) {
> +                               __swap_count(entry) == 1 &&
> +                               !__swap_has_cache(entry)) {
>                         /* skip swapcache */
>                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>                                                         vmf->address);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 80445f4..2a1554a8 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
>         return count;
>  }
> 
> +bool __swap_has_cache(swp_entry_t entry)
> +{
> +       struct swap_info_struct *si;
> +       pgoff_t offset = swp_offset(entry);
> +       bool has_cache  = false;
> +
> +       si = get_swap_device(entry);
> +       if (si) {
> +               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
> +               put_swap_device(si);
> +       }
> +       return has_cache;
> +}
> +
>  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
>  {
>         int count = 0;
> 
> 
> >
> >> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>
> >> ---
> >>  mm/memory.c | 21 ++++++++++++++++-----
> >>  1 file changed, 16 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e0c232f..22643aa 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>  	struct page *page = NULL, *swapcache;
> >>  	struct mem_cgroup *memcg;
> >>  	swp_entry_t entry;
> >> +	struct swap_info_struct *si;
> >> +	bool skip_swapcache = false;
> >>  	pte_t pte;
> >>  	int locked;
> >>  	int exclusive = 0;
> >> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>  
> >>  
> >>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> >> +
> >> +	/*
> >> +	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
> >> +	 * check is made, another process can populate the swapcache, delete
> >> +	 * the swap entry and decrement the swap count. So decide on taking
> >> +	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
> >> +	 * race described, the victim process will find a swap_count > 1
> >> +	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
> >> +	 */
> >> +	si = swp_swap_info(entry);
> >> +	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
> >> +		skip_swapcache = true;
> >> +
> >>  	page = lookup_swap_cache(entry, vma, vmf->address);
> >>  	swapcache = page;
> >>  
> >>  	if (!page) {
> >> -		struct swap_info_struct *si = swp_swap_info(entry);
> >> -
> >> -		if (si->flags & SWP_SYNCHRONOUS_IO &&
> >> -				__swap_count(entry) == 1) {
> >> -			/* skip swapcache */
> >> +		if (skip_swapcache) {
> >>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> >>  							vmf->address);
> >>  			if (page) {
> >> -- 
> >> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> >> member of the Code Aurora Forum, hosted by The Linux Foundation
> >>