On 9/10/2019 11:21 PM, Minchan Kim wrote: > On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote: >> Hi Minchan, >> >> >> On 9/10/2019 4:56 AM, Minchan Kim wrote: >>> Hi Vinayak, >>> >>> On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote: >>>> The following race is observed due to which a processes faulting >>>> on a swap entry, finds the page neither in swapcache nor swap. This >>>> causes zram to give a zero filled page that gets mapped to the >>>> process, resulting in a user space crash later. >>>> >>>> Consider parent and child processes Pa and Pb sharing the same swap >>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. >>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry. >>>> >>>> Pa Pb >>>> >>>> fault on VA fault on VA >>>> do_swap_page do_swap_page >>>> lookup_swap_cache fails lookup_swap_cache fails >>>> Pb scheduled out >>>> swapin_readahead (deletes zram entry) >>>> swap_free (makes swap_count 1) >>>> Pb scheduled in >>>> swap_readpage (swap_count == 1) >>>> Takes SWP_SYNCHRONOUS_IO path >>>> zram enrty absent >>>> zram gives a zero filled page >>>> >>>> Fix this by reading the swap_count before lookup_swap_cache, which conforms >>>> with the order in which page is added to swap cache and swap count is >>>> decremented in do_swap_page. In the race case above, this will let Pb take >>>> the readahead path and thus pick the proper page from swapcache. >>> Thanks for the report, Vinayak. >>> >>> It's a zram specific issue because it deallocates zram block >>> unconditionally once read IO is done. The expectation was that dirty >>> page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true >>> any more so I want to resolve the issue in zram specific code, not >>> general one. >> >> Thanks for comments Minchan. >> >> Trying to understand your comment better. With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will >> >> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid >> >> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the >> >> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or >> >> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram >> >> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache. >> >> >>> A idea in my mind is swap_slot_free_notify should check the slot >>> reference counter and if it's higher than 1, it shouldn't free the >>> slot until. What do you think about? >> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which >> >> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing > It's always trade-off between memory vs performance since it could hit > in swap cache. If it's shared page, it's likely to hit a cache next time > so we could get performance benefit. > > Actually, swap_slot_free_notify is layering violation so I wanted to > replace it with discard hint in the long run so want to go the direction. Okay got it. > >> a valid swapcache entry ? >> >> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ? > With your approach, what prevent below scenario? > > A B > > do_swap_page > SWP_SYNCHRONOUS_IO && __swap_count == 1 As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now) If so, that read itself would have deleted the zram entry ? And the read page will be in swapcache and dirty ? In that case, with SWAP_HAS_CACHE check in the patch, B will take readahead path. And shrink_page_list would attempt a pageout to zram, for the dirty page ? > shrink_page_list > add_to_swap > swap_count = 2 > > .. > .. > do_swap_page > swap_read > swap_slot_free_notify > zram's slot will be removed > page = alloc_page_vma > swap_readpage <-- read zero > > >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index 063c0c1..a5ca05f 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); >> extern sector_t swapdev_block(int, pgoff_t); >> extern int page_swapcount(struct page *); >> extern int __swap_count(swp_entry_t entry); >> +extern bool __swap_has_cache(swp_entry_t entry); >> extern int __swp_swapcount(swp_entry_t entry); >> extern int swp_swapcount(swp_entry_t entry); >> extern struct swap_info_struct *page_swap_info(struct page *); >> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry) >> return 0; >> } >> >> +static bool __swap_has_cache(swp_entry_t entry) >> +{ >> + return 0; >> +} >> + >> static inline int __swp_swapcount(swp_entry_t entry) >> { >> return 0; >> >> diff --git a/mm/memory.c b/mm/memory.c >> index e0c232f..a13511f 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> struct swap_info_struct *si = swp_swap_info(entry); >> >> if (si->flags & SWP_SYNCHRONOUS_IO && >> - __swap_count(entry) == 1) { >> + __swap_count(entry) == 1 && >> + !__swap_has_cache(entry)) { >> /* skip swapcache */ >> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, >> vmf->address); >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 80445f4..2a1554a8 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry) >> return count; >> } >> >> +bool __swap_has_cache(swp_entry_t entry) >> +{ >> + struct swap_info_struct *si; >> + pgoff_t offset = swp_offset(entry); >> + bool has_cache = false; >> + >> + si = get_swap_device(entry); >> + if (si) { >> + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); >> + put_swap_device(si); >> + } >> + return has_cache; >> +} >> + >> static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) >> { >> int count = 0; >> >> >>>> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx> >>>> --- >>>> mm/memory.c | 21 ++++++++++++++++----- >>>> 1 file changed, 16 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index e0c232f..22643aa 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> struct page *page = NULL, *swapcache; >>>> struct mem_cgroup *memcg; >>>> swp_entry_t entry; >>>> + struct swap_info_struct *si; >>>> + bool skip_swapcache = false; >>>> pte_t pte; >>>> int locked; >>>> int exclusive = 0; >>>> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> >>>> >>>> delayacct_set_flag(DELAYACCT_PF_SWAPIN); >>>> + >>>> + /* >>>> + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO >>>> + * check is made, another process can populate the swapcache, delete >>>> + * the swap entry and decrement the swap count. So decide on taking >>>> + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the >>>> + * race described, the victim process will find a swap_count > 1 >>>> + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO. >>>> + */ >>>> + si = swp_swap_info(entry); >>>> + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1) >>>> + skip_swapcache = true; >>>> + >>>> page = lookup_swap_cache(entry, vma, vmf->address); >>>> swapcache = page; >>>> >>>> if (!page) { >>>> - struct swap_info_struct *si = swp_swap_info(entry); >>>> - >>>> - if (si->flags & SWP_SYNCHRONOUS_IO && >>>> - __swap_count(entry) == 1) { >>>> - /* skip swapcache */ >>>> + if (skip_swapcache) { >>>> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, >>>> vmf->address); >>>> if (page) { >>>> -- >>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a >>>> member of the Code Aurora Forum, hosted by The Linux Foundation >>>>