Re: [PATCH] mm/page_alloc: fix a crash in free_pages_prepare()

David Hildenbrand <david@xxxxxxxxxx> · Sat, 28 Sep 2019 11:06:56 +0200

On 28.09.19 00:17, Alexander Duyck wrote:
> On Fri, Sep 27, 2019 at 2:59 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai <cai@xxxxxx> wrote:
>>
>>>>
>>>> So I think you've moved the arch_free_page() to be after the final
>>>> thing which can access page contents, yes?  If so, we should have a
>>>> comment in free_pages_prepare() to attmept to prevent this problem from
>>>> reoccurring as the code evolves?
>>>
>>> Right, something like this above arch_free_page() there?
>>>
>>> /*
>>>  * It needs to be just above kernel_map_pages(), as s390 could mark those
>>>  * pages unused and then trigger a fault when accessing.
>>>  */
>>
>> I did this.
>>
>> --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
>> +++ a/mm/page_alloc.c
>> @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
>>                 kernel_init_free_pages(page, 1 << order);
>>
>>         kernel_poison_pages(page, 1 << order, 0);
>> +       /*
>> +        * arch_free_page() can make the page's contents inaccessible.  s390
>> +        * does this.  So nothing which can access the page's contents should
>> +        * happen after this.
>> +        */
>>         arch_free_page(page, order);
>> +
>>         if (debug_pagealloc_enabled())
>>                 kernel_map_pages(page, 1 << order, 0);
>>
> 
> So the question I would have is what is the state of the page after it
> has been marked unused and then pulled back in? I'm assuming it will
> be all 0s.

I think this comment relates to the s390x implementation, so I'll try to
explain that. After arch_free_page() the page might have been zapped in
the hypervisor, but that might happen deferred. The guest ends up
triggering the ESSA instruction in arch_free_page(). That instruction
sets some extended-page-table-related ("pgste") bits in the hypervisor
tables for the guest ("gmap") and fills a buffer with these entries. The
page is marked _PGSTE_GPS_USAGE_UNUSED.

Once the buffer is full, the ESSA instruction intercepts to the
hypervisor, where the hypervisor can go over all recorded entries and
zap them *in case* the extended-page-table-related bits still indicate
that the page is unused by the guest (_PGSTE_GPS_USAGE_UNUSED) or is
marked to be logically zero (_PGSTE_GPS_ZERO). Zapping a page only
happens if it's a pte_swap(pte) entry and effectively triggers a
ptep_zap_unused() -> ptep_zap_swap_entry() -> free_swap_and_cache(). So
I think it will be backed with the zero page when pulled back in.

arch_alloc_page() will similarly trigger the ESSA instruction but only
set the extended-page-table-related bits, so the entry is no longer
_PGSTE_GPS_USAGE_UNUSED. This is basically to make sure a page won't get
zapped in the hypervisor while it is already getting used by the guest
again.

The implementation on the KVM side resides in
arch/s390/kvm/priv.c:handle_essa() but more importantly in
arch/s390/kvm/priv.c:__do_essa() ("pure interpretation path skipping
hardware interpretation completely").

Now, one interesting thing resides in
arch/s390/kvm/priv.c:pgste_perform_essa():
	/* If we are discarding a page, set it to logical zero */
	if (res)
		pgstev |= _PGSTE_GPS_ZERO;

So whenever we do an arch_free_page() in the guest, the page will
immediately also be set in the hypervisor to _PGSTE_GPS_ZERO. However, I
think setting the page logically to zero is just an "extended HW state"
and will not actually result in the page reading zeroes before we
actually zap it. I might be wrong and I only see one place where
_PGSTE_GPS_ZERO actually gets cleared again, especially when setting a
page stable (which looks bogus but as the documentation is confidential
I have no idea what's happening there).

Long story short: I think there is *no guarantee* that
a) After arch_free_page(), the page is actually zeroed-out
b) After arch_free_page() the page has been zapped in the hypervisor or
   will get zapped.
c) After arch_alloc_page(), the page was actually zeroed-out.

I might be wrong, depending on how _PGSTE_GPS_ZERO is actually used.

However, *if* the page was zapped in the hypervisor
(free_swap_and_cache()), I think it will get populated using the zero-page.

Also, please not that s390x requires an arch_alloc_page() after an
arch_free_page(). You cannot simply go ahead and reuse the page after
arch_alloc_page().

> 
> I know with the work I am still doing on marking pages as unused this
> ends up being an additional item that we will need to pay attention
> to, however in our case we will just be initializing the page as zero
> if we end up evicting it from the guest.

Please note that if you are using MADV_FREE instead of MADV_DONTNEED in
the hypervisor, you might end up with the same guarantees that s390x'
implementation gives you. Could be, that the page was not zapped/zeroed
out on the next access. Depends on if the hypervisor was feeling like
zapping entries marked using MADV_FREE.

-- 

Thanks,

David / dhildenb