On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote: > Hi Minchan, > > > On 9/10/2019 4:56 AM, Minchan Kim wrote: > > Hi Vinayak, > > > > On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote: > >> The following race is observed due to which a processes faulting > >> on a swap entry, finds the page neither in swapcache nor swap. This > >> causes zram to give a zero filled page that gets mapped to the > >> process, resulting in a user space crash later. > >> > >> Consider parent and child processes Pa and Pb sharing the same swap > >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. > >> Virtual address 'VA' of Pa and Pb points to the shared swap entry. > >> > >> Pa Pb > >> > >> fault on VA fault on VA > >> do_swap_page do_swap_page > >> lookup_swap_cache fails lookup_swap_cache fails > >> Pb scheduled out > >> swapin_readahead (deletes zram entry) > >> swap_free (makes swap_count 1) > >> Pb scheduled in > >> swap_readpage (swap_count == 1) > >> Takes SWP_SYNCHRONOUS_IO path > >> zram enrty absent > >> zram gives a zero filled page > >> > >> Fix this by reading the swap_count before lookup_swap_cache, which conforms > >> with the order in which page is added to swap cache and swap count is > >> decremented in do_swap_page. In the race case above, this will let Pb take > >> the readahead path and thus pick the proper page from swapcache. > > Thanks for the report, Vinayak. > > > > It's a zram specific issue because it deallocates zram block > > unconditionally once read IO is done. The expectation was that dirty > > page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true > > any more so I want to resolve the issue in zram specific code, not > > general one. > > > Thanks for comments Minchan. > > Trying to understand your comment better. With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will > > make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid > > entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the > > race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or > > SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram > > entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache. > > > > > > A idea in my mind is swap_slot_free_notify should check the slot > > reference counter and if it's higher than 1, it shouldn't free the > > slot until. What do you think about? > > It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which > > can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing It's always trade-off between memory vs performance since it could hit in swap cache. If it's shared page, it's likely to hit a cache next time so we could get performance benefit. Actually, swap_slot_free_notify is layering violation so I wanted to replace it with discard hint in the long run so want to go the direction. > > a valid swapcache entry ? > > Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? With your approach, what prevent below scenario? A B do_swap_page SWP_SYNCHRONOUS_IO && __swap_count == 1 shrink_page_list add_to_swap swap_count = 2 .. .. do_swap_page swap_read swap_slot_free_notify zram's slot will be removed page = alloc_page_vma swap_readpage <-- read zero > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 063c0c1..a5ca05f 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); > extern sector_t swapdev_block(int, pgoff_t); > extern int page_swapcount(struct page *); > extern int __swap_count(swp_entry_t entry); > +extern bool __swap_has_cache(swp_entry_t entry); > extern int __swp_swapcount(swp_entry_t entry); > extern int swp_swapcount(swp_entry_t entry); > extern struct swap_info_struct *page_swap_info(struct page *); > @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry) > return 0; > } > > +static bool __swap_has_cache(swp_entry_t entry) > +{ > + return 0; > +} > + > static inline int __swp_swapcount(swp_entry_t entry) > { > return 0; > > diff --git a/mm/memory.c b/mm/memory.c > index e0c232f..a13511f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > struct swap_info_struct *si = swp_swap_info(entry); > > if (si->flags & SWP_SYNCHRONOUS_IO && > - __swap_count(entry) == 1) { > + __swap_count(entry) == 1 && > + !__swap_has_cache(entry)) { > /* skip swapcache */ > page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, > vmf->address); > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 80445f4..2a1554a8 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry) > return count; > } > > +bool __swap_has_cache(swp_entry_t entry) > +{ > + struct swap_info_struct *si; > + pgoff_t offset = swp_offset(entry); > + bool has_cache = false; > + > + si = get_swap_device(entry); > + if (si) { > + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); > + put_swap_device(si); > + } > + return has_cache; > +} > + > static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) > { > int count = 0; > > > > > >> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx> > >> --- > >> mm/memory.c | 21 ++++++++++++++++----- > >> 1 file changed, 16 insertions(+), 5 deletions(-) > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index e0c232f..22643aa 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> struct page *page = NULL, *swapcache; > >> struct mem_cgroup *memcg; > >> swp_entry_t entry; > >> + struct swap_info_struct *si; > >> + bool skip_swapcache = false; > >> pte_t pte; > >> int locked; > >> int exclusive = 0; > >> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> > >> > >> delayacct_set_flag(DELAYACCT_PF_SWAPIN); > >> + > >> + /* > >> + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO > >> + * check is made, another process can populate the swapcache, delete > >> + * the swap entry and decrement the swap count. So decide on taking > >> + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the > >> + * race described, the victim process will find a swap_count > 1 > >> + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO. > >> + */ > >> + si = swp_swap_info(entry); > >> + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1) > >> + skip_swapcache = true; > >> + > >> page = lookup_swap_cache(entry, vma, vmf->address); > >> swapcache = page; > >> > >> if (!page) { > >> - struct swap_info_struct *si = swp_swap_info(entry); > >> - > >> - if (si->flags & SWP_SYNCHRONOUS_IO && > >> - __swap_count(entry) == 1) { > >> - /* skip swapcache */ > >> + if (skip_swapcache) { > >> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, > >> vmf->address); > >> if (page) { > >> -- > >> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a > >> member of the Code Aurora Forum, hosted by The Linux Foundation > >>