Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG

Nadav Amit <nadav.amit@xxxxxxxxx> · Wed, 15 Jun 2022 12:42:13 -0700

On Jun 15, 2022, at 11:39 AM, Peter Xu <peterx@xxxxxxxxxx> wrote:

> On Wed, Jun 15, 2022 at 09:58:27AM -0700, Nadav Amit wrote:
>> On Jun 15, 2022, at 8:43 AM, Peter Xu <peterx@xxxxxxxxxx> wrote:
>> 
>>> On Wed, Jun 15, 2022 at 10:26:21AM +0300, Mike Rapoport wrote:
>>>> On Tue, Jun 14, 2022 at 01:56:56PM -0700, Nadav Amit wrote:
>>>>> On Jun 14, 2022, at 1:40 PM, John Hubbard <jhubbard@xxxxxxxxxx> wrote:
>>>>> 
>>>>>> On 6/14/22 11:56, Mike Rapoport wrote:
>>>>>>>> But, I cannot take it anymore: the list of arguments for uffd stuff is
>>>>>>>> crazy. I would like to collect all the possible arguments that are used for
>>>>>>>> uffd operation into some “struct uffd_op”.
>>>>>>> Squashing boolean parameters into int flags will also reduce the insane
>>>>>>> amount of parameters. No strong feelings though.
>>>>>> 
>>>>>> Just a quick drive-by comment about boolean arguments: they ruin the
>>>>>> readability of the call sites. In practice, sometimes a single boolean
>>>>>> argument can be OK-ish (still poor to read at the call site, but easier
>>>>>> to code initially), but once you get past one boolean argument in the
>>>>>> function, readability is hopeless:
>>>>>> 
>>>>>>  foo(ptr, true, false, a == b);
>>>>>> 
>>>>>> So if you have a choice, I implore you to prefer flags and/or enums. :)
>>>>> 
>>>>> Thanks for the feedback - I am aware it is very confusing to have booleans
>>>>> and especially multiple ones in a func call.
>>>>> 
>>>>> Just not sure how it maps to what I proposed. I thought of passing as an
>>>>> argument reference (pointer) to something similar to the following struct,
>>>>> which I think is very self-descriptive:
>>>>> 
>>>>> struct uffd_op {
>>>>> 	/* various fields */
>>>>> 	struct vm_area_struct *dst_vma;
>>>>> 	unsigned long len;
>>>>> 	atomic_t *mmap_changing;
>>>>> 
>>>>> 	...
>>>>> 	
>>>>> 	/* ... and some flags */
>>>>> 	int wp: 1;
>>>>> 	int zero: 1;
>>>>> 	int read_likely: 1;
>>>>> 
>>>>> 	...
>>>>> };
>>>>> 
>>>>> I think that fits what you were asking for. The only thing I am not sure of,
>>>>> is whether to include in uffd_op fields that are internal to mm/userfaultfd
>>>>> such as “page” and “newly_allocated”. I guess not.
>>>> 
>>>> mfill_atomic_install_pte() is called by shmem_mfill_atomic_pte() so it's
>>>> not entirely internal to mm/userfaultfd.c.
>>>> 
>>>> Another thing is that with all the parameters packed into a struct, the
>>>> call sites could become really hairy, so maybe the best way would be to
>>>> pack some of the parameters and leave the others.
>>>> 
>>>> But you'll never know until you try :)
>>> 
>>> Yeh.  Axel packed some booleans in f619147104c8e into mcopy_atomic_mode.
>>> The other option (besides uffd_ops) could be making mcopy_atomic_mode a
>>> bitmask and keep the rest, the mode itself only took 2 bits.
>>> 
>>> uffd_ops sounds good too if the final outcome looks clean, since we do pass
>>> quite a few things over and over deep into the stack.
>> 
>> Thanks.
>> 
>> I see 3 options:
>> 
>> 1. Pack only fs/mm flags: WP, read-likely, write-likely.
>> 2. (1) + as part of the flags internally include Axel’s copy_atomic_mode.
>> 3. The uffd_op approach: include all relevant fields.
>> 
>> For the time being I’m going with (1) since I do not have too much time to
>> finish all of that and upstream the rest of my work (Broadcom is knocking).
>> 
>> (3) also has the downside of stack-protector that would be added due to
>> stack-protector strong, which is not-that-bad, but I hate it.
> 
> Any short (but further) explanations?

Sure, but it might not be short.

So one time Mathew Wilcox tried to convert some function arguments into a
struct that would hold the arguments and transfer it as a single argument,
and he found out that the binary size actually increased. Since I like to
waste my time, I analyzed why.

IIRC, there were two main reasons.

The first one is that the kernel by default uses the “strong”
stack-protector. It means that not all functions would have a stack
protector, and actually only quite few would. One main reason that you have
a stack protector is that you provide dereference a local variable. So if
you have a local struct that hold the arguments - you get a stack protector,
and it does introduce slight overhead (~10ns IIRC). There may be some pragma
to prevent the stack protector, but clearly I will be shot if I used it.

The second reason is that the compiler either reloads data from the struct
you use to hold the arguments or might spill it to the stack if you try to
cache it.

Consider the following two scenarios:

A. You access an argument multiple times:

	local1 = args->arg1;
	another_fn();	// Or some store to the heap
	local2 = args->arg1;

	// You use local1 and local2

In this case the compiler would reload args->arg1 from memory, even if there
is a register that holds the value. The compiler is concerned that another_fn()
might have overwritten args->arg1 or - in the case of a store - that the value
was overwritten. The reload might prevent further optimizations of the compiler.

B. You cache the argument locally (aka, you decided to be “smart”):

	arg1 = args->arg1;
	local1 = arg1;
	another_fn();
	local2 = arg1;

You may think that this prevents the reload. But this might even be worse.
The compiler might run out of registers spill arg1 to the stack and then
access it from the stack. Or it might need to spill something else, or
shuffle registers around.

So what can you do? You can mark another_fn() as pure if it is so, which
does help in very certain cases. There are various limitations though. IIRC,
gcc (or is it clang?) ignores it for inline functions. So if you have an
inline function which does some write that you don’t care about you cannot
force the compiler to ignore it.

Note that at least gcc (IIRC) regards inline assembly as something that might
write to arbitrary memory address. So having BUG_ON() would require a reload
of the argument from the struct.

You may say: “what about the restrict keyword?” Well, this is nice in
theory, but it does not always help and the implementation of gcc and clang
are not even the same in this regard. IIRC, in one of them (I forgot which),
the restrict keyword will prevent the reload if instead another_fn() there
was a store, but once you call a different function, that compiler says “all
bets are off, I ignore the restrict” and would reload/spill arg1 (depending
on A or B).

>> Three more points for consideration in future cleanups:
>> 
>> 1. This __always_inline thingy is crazy IMHO. The size of the compilation
>> unit is almost double because of it, and I saw no explanation for its use in
>> the commit log (unless I missed it). The overheads in userfaultfd are mostly
>> due to memory copying, scheduling, IPIs.
> 
> Do you meant __mcopy_atomic()?  I agree it's suspicious.  That's a huge
> function with three callers.  Even if to mark it inline I think it's more
> proper to mark it at the three callers, or I'd think we can leave that for
> the compilers.

Among others. But yes, __mcopy_atomic() is the craziest one.

> 
>> 2. I think it makes more sense to strive not to have more than 6 arguments
>> for each function (as supported in registers on x86). For that it is possible
>> to get rid of dst_mm when it can be retrieved from dst_vma. Anyhow we access
>> dst_vma->vm_flags which share a cache-line with dst_vma->vm_mm.
>> 
>> 3. These BUG_ON()s all around are also ... excessive. I guess they were
>> introduced before the age in which Linus got angry on each BUG_ON(). Is
>> there any good reason not to change them into VM_BUG_ON()?
> 
> I'm not sure how that was handled, I'd personally think it's fine to keep
> them if they're there for ages.  They're gone already if they can be
> triggered in any bad ways anyway..  But for sure we'd keep an eye on new
> ones when reviewing.

Whatever. They might just cause slightly less optimized code to be
generated.