On Thu, Sep 24, 2020 at 03:45:52PM +0800, Huang, Ying wrote: > Rafael Aquini <aquini@xxxxxxxxxx> writes: > > > On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote: > >> Rafael Aquini <aquini@xxxxxxxxxx> writes: > >> > The bug here is quite simple: split_swap_cluster() misses checking for > >> > lock_cluster() returning NULL before committing to change cluster_info->flags. > >> > >> I don't think so. We shouldn't run into this situation firstly. So the > >> "fix" hides the real bug instead of fixing it. Just like we call > >> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list() > >> instead of returning if !PageLocked(head) silently. > >> > > > > Not the same thing, obviously, as you are going for an apples-to-carrots > > comparison, but since you mentioned: > > > > split_huge_page_to_list() asserts (in debug builds) *page is locked, > > VM_BUG_ON_PAGE(!PageLocked(head), head); > > It asserts *head instead of *page. > > > and later checks if *head bears the SwapCache flag. > > deferred_split_scan(), OTOH, doesn't hand down the compound head locked, > > but the 2nd page in the group instead. > > No. deferred_split_scan() will can trylock_page() on the 2nd page in > the group, but > > static inline int trylock_page(struct page *page) > { > page = compound_head(page); > return (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); > } > > So the head page will be locked instead. > Yep, missed that. Thanks for straighten me out on this one. > > This doesn't necessarely means it's a problem, though, but might help > > on hitting the issue. > > > >> > The fundamental problem has nothing to do with allocating, or not allocating > >> > a swap cluster, but it has to do with the fact that the THP deferred split scan > >> > can transiently race with swapcache insertion, and the fact that when you run > >> > your swap area on rotational storage cluster_info is _always_ NULL. > >> > split_swap_cluster() needs to check for lock_cluster() returning NULL because > >> > that's one possible case, and it clearly fails to do so. > >> > >> If there's a race, we should fix the race. But the code path for > >> swapcache insertion is, > >> > >> add_to_swap() > >> get_swap_page() /* Return if fails to allocate */ > >> add_to_swap_cache() > >> SetPageSwapCache() > >> > >> While the code path to split THP is, > >> > >> split_huge_page_to_list() > >> if PageSwapCache() > >> split_swap_cluster() > >> > >> Both code paths are protected by the page lock. So there should be some > >> other reasons to trigger the bug. > > > > As mentioned above, no they seem to not be protected (at least, not the > > same page, depending on the case). While add_to_swap() will assure a > > page_lock on the compound head, split_huge_page_to_list() does not. > > > > > >> And again, for HDD, a THP shouldn't have PageSwapCache() set at the > >> first place. If so, the bug is that the flag is set and we should fix > >> the setting. > >> > > > > I fail to follow your claim here. Where is the guarantee, in the code, that > > you'll never have a compound head in the swapcache? > > We may have a THP in the swap cache, only if non-rotational disk is used > as swap device. This is the design assumption of the THP swap support. > And this is guaranteed via swap space allocation for THP will fail for > HDD. If the implementation doesn't guarantee this, we will fix the > implementation to guarantee this. > > >> > Run a workload that cause multiple THP COW, and add a memory hogger to create > >> > memory pressure so you'll force the reclaimers to kick the registered > >> > shrinkers. The trigger is not heavy swapping, and that's probably why > >> > most swap test cases don't hit it. The window is tight, but you will get the > >> > NULL pointer dereference. > >> > >> Do you have a script to reproduce the bug? > >> > > > > Nope, a convoluted set of internal regression tests we have usually > > triggers it. In the wild, customers running HANNA are seeing it, > > occasionally. > > So you haven't reproduce the bug on upstream kernel? > Have you seen the stack dump in the patch? It still reproduces with v5.9, even though the rate is a lot lower than with earlier kernels. > Or, can you help to run the test with a debug kernel based on upstream > kernel. I can provide some debug patch. > Sure, I can set your patches to run with the test cases we have that tend to reproduce the issue with some degree of success. > >> > Regardless you find furhter bugs, or not, this patch is needed to correct a > >> > blunt coding mistake. > >> > >> As above. I don't agree with that. > >> > > > > It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check, > > though. > > In contrast, if the checking is necessary, we shouldn't ignore it, but > use something like > > ci = lock_cluster(si, offset); > + VM_BUG_ON(!ci); Wrong. This will still allow for NULL ptr dereference on non-debug builds. If ci can be NULL -- and it clearly can, we need to protect cluster_clear_huge(ci) against that.