Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly

Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> · Mon, 31 Oct 2022 09:58:41 -0700

On Sat, Oct 29, 2022 at 10:23:02AM -0400, Mathieu Desnoyers wrote:
> On 2022-10-28 18:35, Beau Belgrave wrote:
> > On Fri, Oct 28, 2022 at 06:07:34PM -0400, Mathieu Desnoyers wrote:
> > > On 2022-10-27 18:40, Beau Belgrave wrote:
> 
> [...]
> 
> > 
> > > I'm still slightly unsure about using "uint32_t" for enable check, or going
> > > for "unsigned long". The core question there is whether a 32-bit word test
> > > would cause partial register stalls on 64-bit architectures. Going for
> > > unsigned long would require that user events receives information about the
> > > bitness of the word as input from userspace. (bit=63 rather than 31).
> > > Perhaps this is something the user events ABI should accommodate by
> > > reserving more than 5 bits to express the target bit ?
> > > 
> > 
> > Yeah, I thought about this. From the user side you can actually use any
> > arbitrary bits you want by passing a 32-bit aligned value. So you could
> > take the lower or top half of a 64-bit value. The reason I limit to 0-31
> > bits is to ensure no page straddling occurs. Futex also does this, check
> > out get_futex_key() in kernel/futex/core.c.
> 
> We can ensure no page straddling by checking the address alignment compared
> to the size of the integer (4 or 8 bytes) also received as input.
> 

Ok.

> > 
> > I would recommend trying out uint64 but give the upper half to
> > user_events ABI and checking if that works for you if you want say 63
> > bits to user space. You'd tell the ABI bit 31, but in user space it
> > would be bit 63.
> 
> That's tricky because Linux supports both big and little endian, which will
> make the ABI tricky to use if we try to be too clever about this.
> 
> Also, just stating a "pointer" and "bits offset" is not enough if we want to
> support 32-bit and 64-bit integer types, because the bit offset does not
> consider endianness.
> 
> I would recommend to get the following information through the user events
> registration:
> 
> - pointer address,
> - size of integer type (4 or 8 bytes),
> - bit offset (currently 31 or 63).
> 
> This way we have all the information we need to set the right bit in both
> little and big endian. We also have all the information we need to validate
> natural alignment, e.g.:
> 
> /* Check alignment */
> if (addr & (type_size - 1))
> 	return -EINVAL;
> /* Check bit offset range */
> if (bit_offset > (type_size * CHAR_BIT) - 1)
> 	return -EINVAL;
> switch (type_size) {
> case 4:
> 	/* Update 32-bit integer */
> 	break;
> #if BITS_PER_LONG >= 64
> case 8:
> 	/* Update 64-bit integer */
> 	break;
> #endif
> default:
> 	return -EINVAL;
> }
> 

Sounds good, I'll update this.

> 
> > 
> > > > 
> > > > The enable address is disabled while the async fault-in occurs. This
> > > > ensures that we don't attempt fault-in more than is necessary. Once the
> > > > page has been faulted in, the address write is attempted again. If the
> > > > page couldn't fault-in, then we wait until the next time the event is
> > > > enabled to prevent any potential infinite loops.
> > > 
> > > So if the page is removed from the page cache between the point where it is
> > > faulted in and the moment the write is re-attempted, that will not trigger
> > > another attempt at paging in the page, am I correct ?
> > > 
> > 
> > If we fault and the fixup user fault fails still with retries, then we
> > give up until the next enablement of that site.
> > 
> > However, if we fault and the fixup user fault with retries works, and
> > then we fault again trying to write, that will retry.
> > 
> > > I would think this is unexpected. I would expect that failing to fault in
> > > the page would stop any further attempts, but simply failing to pin the page
> > > after faulting it in should simply try again.
> > > 
> > 
> > The issue I repro is mmap() register the enable at that location, then
> > unmap(). All the faulting errors just tell you -EFAULT in this case,
> > even though it was maliciously removed. In this case you could get the
> > kernel to infinite loop trying to page it in.
> > 
> > > Thoughts ?
> > > 
> > 
> > We need to have some sanity toward giving up faulting in data that never
> > will make it. The code currently assigns that line to if
> > fixup_user_fault with retries fails, we give up. pin_user_pages_remote
> > will not stop it from being attempted again.
> 
> What are the legitimate cases that can make fixup_user_fault fail ? If there
> are none, and the only case that can make this fail is userspace unmapping
> memory, then we should very well kill the offending process.
> 

I believe only VM_FAULT_ERROR:
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS |	\
			VM_FAULT_SIGSEGV | VM_FAULT_HWPOISON |	\
			VM_FAULT_HWPOISON_LARGE | VM_FAULT_FALLBACK)

I don't believe in any of the above cases we should retry indefinitely
for.

> An example here is futex fault_in_user_writeable(). When it fails, it
> appears that the futex_wake_op simply returns -EFAULT to the caller.
> 

fault_in_user_writeable() calls fixup_user_fault(), so it should be the
same scenario / failures as above.

> Thanks,
> 
> Mathieu
> 
> > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > 
> > Thanks,
> > -Beau
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

Thanks,
-Beau