On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote: > Rafael Aquini <aquini@xxxxxxxxxx> writes: > > The bug here is quite simple: split_swap_cluster() misses checking for > > lock_cluster() returning NULL before committing to change cluster_info->flags. > > I don't think so. We shouldn't run into this situation firstly. So the > "fix" hides the real bug instead of fixing it. Just like we call > VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list() > instead of returning if !PageLocked(head) silently. > Not the same thing, obviously, as you are going for an apples-to-carrots comparison, but since you mentioned: split_huge_page_to_list() asserts (in debug builds) *page is locked, and later checks if *head bears the SwapCache flag. deferred_split_scan(), OTOH, doesn't hand down the compound head locked, but the 2nd page in the group instead. This doesn't necessarely means it's a problem, though, but might help on hitting the issue. > > The fundamental problem has nothing to do with allocating, or not allocating > > a swap cluster, but it has to do with the fact that the THP deferred split scan > > can transiently race with swapcache insertion, and the fact that when you run > > your swap area on rotational storage cluster_info is _always_ NULL. > > split_swap_cluster() needs to check for lock_cluster() returning NULL because > > that's one possible case, and it clearly fails to do so. > > If there's a race, we should fix the race. But the code path for > swapcache insertion is, > > add_to_swap() > get_swap_page() /* Return if fails to allocate */ > add_to_swap_cache() > SetPageSwapCache() > > While the code path to split THP is, > > split_huge_page_to_list() > if PageSwapCache() > split_swap_cluster() > > Both code paths are protected by the page lock. So there should be some > other reasons to trigger the bug. As mentioned above, no they seem to not be protected (at least, not the same page, depending on the case). While add_to_swap() will assure a page_lock on the compound head, split_huge_page_to_list() does not. > And again, for HDD, a THP shouldn't have PageSwapCache() set at the > first place. If so, the bug is that the flag is set and we should fix > the setting. > I fail to follow your claim here. Where is the guarantee, in the code, that you'll never have a compound head in the swapcache? > > Run a workload that cause multiple THP COW, and add a memory hogger to create > > memory pressure so you'll force the reclaimers to kick the registered > > shrinkers. The trigger is not heavy swapping, and that's probably why > > most swap test cases don't hit it. The window is tight, but you will get the > > NULL pointer dereference. > > Do you have a script to reproduce the bug? > Nope, a convoluted set of internal regression tests we have usually triggers it. In the wild, customers running HANNA are seeing it, occasionally. > > Regardless you find furhter bugs, or not, this patch is needed to correct a > > blunt coding mistake. > > As above. I don't agree with that. > It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check, though.