Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Rafael Aquini <aquini@xxxxxxxxxx> writes:

> On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
>> Rafael Aquini <aquini@xxxxxxxxxx> writes:
>> > The bug here is quite simple: split_swap_cluster() misses checking for
>> > lock_cluster() returning NULL before committing to change cluster_info->flags.
>> 
>> I don't think so.  We shouldn't run into this situation firstly.  So the
>> "fix" hides the real bug instead of fixing it.  Just like we call
>> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
>> instead of returning if !PageLocked(head) silently.
>>
>
> Not the same thing, obviously, as you are going for an apples-to-carrots
> comparison, but since you mentioned:
>
> split_huge_page_to_list() asserts (in debug builds) *page is locked,

	VM_BUG_ON_PAGE(!PageLocked(head), head);

It asserts *head instead of *page.

> and later checks if *head bears the SwapCache flag. 
> deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
> but the 2nd page in the group instead.

No.  deferred_split_scan() will can trylock_page() on the 2nd page in
the group, but

static inline int trylock_page(struct page *page)
{
	page = compound_head(page);
	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
}

So the head page will be locked instead.

> This doesn't necessarely means it's a problem, though, but might help
> on hitting the issue. 
>  
>> > The fundamental problem has nothing to do with allocating, or not allocating
>> > a swap cluster, but it has to do with the fact that the THP deferred split scan
>> > can transiently race with swapcache insertion, and the fact that when you run
>> > your swap area on rotational storage cluster_info is _always_ NULL.
>> > split_swap_cluster() needs to check for lock_cluster() returning NULL because
>> > that's one possible case, and it clearly fails to do so.
>> 
>> If there's a race, we should fix the race.  But the code path for
>> swapcache insertion is,
>> 
>> add_to_swap()
>>   get_swap_page() /* Return if fails to allocate */
>>   add_to_swap_cache()
>>     SetPageSwapCache()
>> 
>> While the code path to split THP is,
>> 
>> split_huge_page_to_list()
>>   if PageSwapCache()
>>     split_swap_cluster()
>> 
>> Both code paths are protected by the page lock.  So there should be some
>> other reasons to trigger the bug.
>
> As mentioned above, no they seem to not be protected (at least, not the
> same page, depending on the case). While add_to_swap() will assure a 
> page_lock on the compound head, split_huge_page_to_list() does not.
>
>
>> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
>> first place.  If so, the bug is that the flag is set and we should fix
>> the setting.
>> 
>
> I fail to follow your claim here. Where is the guarantee, in the code, that 
> you'll never have a compound head in the swapcache? 

We may have a THP in the swap cache, only if non-rotational disk is used
as swap device.  This is the design assumption of the THP swap support.
And this is guaranteed via swap space allocation for THP will fail for
HDD.  If the implementation doesn't guarantee this, we will fix the
implementation to guarantee this.

>> > Run a workload that cause multiple THP COW, and add a memory hogger to create
>> > memory pressure so you'll force the reclaimers to kick the registered
>> > shrinkers. The trigger is not heavy swapping, and that's probably why
>> > most swap test cases don't hit it. The window is tight, but you will get the
>> > NULL pointer dereference.
>> 
>> Do you have a script to reproduce the bug?
>> 
>
> Nope, a convoluted set of internal regression tests we have usually
> triggers it. In the wild, customers running HANNA are seeing it,
> occasionally.

So you haven't reproduce the bug on upstream kernel?

Or, can you help to run the test with a debug kernel based on upstream
kernel.  I can provide some debug patch.

>> > Regardless you find furhter bugs, or not, this patch is needed to correct a
>> > blunt coding mistake.
>> 
>> As above.  I don't agree with that.
>> 
>
> It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check,
> though.

In contrast, if the checking is necessary, we shouldn't ignore it, but
use something like

        ci = lock_cluster(si, offset);
+       VM_BUG_ON(!ci);
	cluster_clear_huge(ci);

in split_swap_cluster() to enforce the checking to report bug as early
as possible.  But this appears unnecessary now because NULL accessing in
cluster_clear_huge().

Best Regards,
Huang, Ying




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux