Re: [PATCH 0/6] mm: split underutilized THPs

David Hildenbrand <david@xxxxxxxxxx> · Tue, 6 Aug 2024 19:30:30 +0200

On 06.08.24 19:17, Usama Arif wrote:

On 05/08/2024 00:04, Usama Arif wrote:

On 01/08/2024 07:36, David Hildenbrand wrote:
I just added a bunch of quick printfs to QEMU and ran a precopy+postcopy live migration. Looks like my assumption was right:

On the destination:

Writing received pages during precopy # ram_load_precopy()
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
Discarding pages # loadvm_postcopy_ram_handle_discard()
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Registering UFFD # postcopy_ram_incoming_setup()

Thanks for this, yes it makes sense after you mentioned postcopy_ram_incoming_setup.
postcopy_ram_incoming_setup happens in the Listen phase, which is after the discard phase, so I was able to follow in code in qemu the same sequence of events that the above prints show.

I just added another printf to postcopy_ram_supported_by_host(), where we temporarily do a UFFDIO_REGISTER on some test area.

Sensing UFFD support # postcopy_ram_supported_by_host()
Sensing UFFD support
Writing received pages during precopy # ram_load_precopy()
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
Discarding pages # loadvm_postcopy_ram_handle_discard()
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Registering UFFD # postcopy_ram_incoming_setup()

We could think about using this "ever user uffd" to avoid the shared zeropage in most processes.

Of course, there might be other applications where that wouldn't work, but I think this behavior (write to area before enabling uffd) might be fairly QEMU specific already.

Avoiding the shared zeropage has the benefit that a later write fault won't have to do a TLB flush and can simply install a fresh anon page.

I checked CRIU and that does a check at the start as well before attempting to use uffd: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L1349

If writing to an area before enabling uffd is likely to be QEMU specific, then you make a good point to clear pte instead of using shared zeropage to avoid the TLB flush if uffd is ever used.

I think "ever used uffd" would need to be tracked using mm_struct. This also won't cause an issue if the check is done in a parent process and the actual use is in a forked process, as copy_mm should take care of it.
The possibilities would then be:
1) Have a new bit in mm->flags, set it in new_userfaultfd and test it in try_to_unmap_unused, but unfortunately all the bits in mm->flags are taken.
2) We could use mm->def_flags as it looks like there is an unused bit (0x800) just before VM_UFFD_WP. But that makes the code confusing as its used to initialize the default flags for VMAs and is not supposed to be used as a "mm flag".
3) Introducing mm->flags2 and set/test as 1. This would introduce a 8 byte overhead for all mm_structs.

I am not sure either 2 or 3 are acceptable upstream, unless there is a need for more flags in the near future and the 8 byte overhead starts to make sense. Maybe we go with shared zeropage?
 >
There is an another option to use bit 32 of mm->flags for 64 bit kernel only for ever_used_uffd, but considering the 2 reasons below, I will send a v2 of the series (in a few days incase any more comments come up) with shared zeropage in all circumstances (and addressing the comments in the other patches).
- "ever used uffd" is not a 100% safe, i.e. someone might not check uffd support before using it and do the same sequence of events as qemu precopy + postcopy (+ some bitmap to track and check whether to request a page from uffd handler). Its very unlikely that anyone else does this, but we have to cater for all current and future usecases.

Possible, but IMHO unlikely :) I'd be willing to take a risk here if we 
make sure CRIU is also fine.

- If THP shrinker is splitting and pointing pages to a shared zeropage, then the page was considered "unused" and unlikely to have a write fault at some point in the near future, hence the probability of incurring that TLB flush on write fault is low.

Surely evidence might help. But it's easy to find cases where this might 
not be the true: populated a THP for a VM and the VM only used some of 
the memory. Then we split, then the VM wants to use the remaining pieces.

Reading your prior email I just wanted to say that mm->flags is unsigned 
long and you should just make the shrinker 64bit-specific and use bit 33 
in mm->flags.

But starting with "all shared zeropage" as a first step also works.

--
Cheers,

David / dhildenb