Hi Michal, Thanks for reviewing this. On 9/2/2019 6:51 PM, Michal Hocko wrote: > On Fri 30-08-19 18:13:31, Vinayak Menon wrote: >> The following race is observed due to which a processes faulting >> on a swap entry, finds the page neither in swapcache nor swap. This >> causes zram to give a zero filled page that gets mapped to the >> process, resulting in a user space crash later. >> >> Consider parent and child processes Pa and Pb sharing the same swap >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. >> Virtual address 'VA' of Pa and Pb points to the shared swap entry. >> >> Pa Pb >> >> fault on VA fault on VA >> do_swap_page do_swap_page >> lookup_swap_cache fails lookup_swap_cache fails >> Pb scheduled out >> swapin_readahead (deletes zram entry) >> swap_free (makes swap_count 1) >> Pb scheduled in >> swap_readpage (swap_count == 1) >> Takes SWP_SYNCHRONOUS_IO path >> zram enrty absent >> zram gives a zero filled page > This sounds like a zram issue, right? Why is a generic swap path changed > then? I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal. This is because zram avoids lazy swap slot freeing by implementing gendisk->fops->swap_slot_free_notify and swap_slot_free_notify deletes the zram entry because the page is in swapcache. The issue is that Pb attempted a swapcache lookup before Pa brought the page to swapcache, and failed. If Pb had taken the swapin_readahead path, __read_swap_cache_async would have performed a second lookup and found the page in swapcache. The issue here is that due to the lookup failure and swap_count being 1, it takes the SWP_SYNCHRONOUS_IO path and does a direct read which is bound to fail. So it seems to me as a problem in the way SWP_SYNCHRONOUS_IO is handled in do_swap_page, and not a problem with zram. Any swap device that sets SWP_SYNCHRONOUS_IO and implements swap_slot_free_notify can hit this bug. do_swap_page first brings in the page to swapcache and then decrements the swap_count, and SWP_SYNCHRONOUS_IO code in do_swap_page performs the swapcache and swap_count checks in the same order. Due to thread preemption described in the sequence above, it can happen that the SWP_SYNCHRONOUS_IO path fails in swapcache check, but sees the swap_count decremented later, thus missing a valid swapcache entry. I have not tested, but the following patch may also fix the issue. diff --git a/include/linux/swap.h b/include/linux/swap.h index 063c0c1..a5ca05f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); extern int page_swapcount(struct page *); extern int __swap_count(swp_entry_t entry); +extern bool __swap_has_cache(swp_entry_t entry); extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry) return 0; } +static bool __swap_has_cache(swp_entry_t entry) +{ + return 0; +} + static inline int __swp_swapcount(swp_entry_t entry) { return 0; diff --git a/mm/memory.c b/mm/memory.c index e0c232f..a13511f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) struct swap_info_struct *si = swp_swap_info(entry); if (si->flags & SWP_SYNCHRONOUS_IO && - __swap_count(entry) == 1) { + __swap_count(entry) == 1 && + !__swap_has_cache(entry)) { /* skip swapcache */ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); diff --git a/mm/swapfile.c b/mm/swapfile.c index 80445f4..2a1554a8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry) return count; } +bool __swap_has_cache(swp_entry_t entry) +{ + struct swap_info_struct *si; + pgoff_t offset = swp_offset(entry); + bool has_cache = false; + + si = get_swap_device(entry); + if (si) { + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); + put_swap_device(si); + } + return has_cache; +} + static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) { int count = 0; > >> Fix this by reading the swap_count before lookup_swap_cache, which conforms >> with the order in which page is added to swap cache and swap count is >> decremented in do_swap_page. In the race case above, this will let Pb take >> the readahead path and thus pick the proper page from swapcache. >> >> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>