Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 2 Nov 2022 09:46:31 -0400

On 2022-10-31 12:53, Beau Belgrave wrote:

On Sat, Oct 29, 2022 at 09:58:26AM -0400, Mathieu Desnoyers wrote:

On 2022-10-28 18:17, Beau Belgrave wrote:

On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote:

On 2022-10-27 18:40, Beau Belgrave wrote:

[...]

NOTE:
User programs that wish to have the enable bit shared across forks
either need to use a MAP_SHARED allocated address or register a new
address and file descriptor. If MAP_SHARED cannot be used or new
registrations cannot be done, then it's allowable to use MAP_PRIVATE
as long as the forked children never update the page themselves. Once
the page has been updated, the page from the parent will be copied over
to the child. This new copy-on-write page will not receive updates from
the kernel until another registration has been performed with this new
address.

This seems rather odd. I would expect that if a parent process registers
some instrumentation using private mappings for enabled state through the
user events ioctl, and then forks, the child process would seamlessly be
traced by the user events ABI while being able to also change the enabled
state from the userspace tracer libraries (which would trigger COW).
Requiring the child to re-register to user events is rather odd.

It's the COW that is the problem, see below.

What is preventing us from tracing the child without re-registration in this
scenario ?

Largely knowing when the COW occurs on a specific page. We don't make
the mappings, so I'm unsure if we can ask to be notified easily during
these times or not. If we could, that would solve this. I'm glad you are
thinking about this. The note here was exactly to trigger this
discussion :)

I believe this is the same as a Futex, I'll take another look at that
code to see if they've come up with anything regarding this.

Any ideas?

Based on your description of the symptoms, AFAIU, upon registration of a
given user event associated with a mm_struct, the user events ioctl appears
to translates the virtual address into a page pointer immediately, and keeps
track of that page afterwards. This means it loses track of the page when
COW occurs.

No, we keep the memory descriptor and virtual address so we can properly
resolve to page per-process.

Why not keep track of the registered virtual address and struct_mm
associated with the event rather than the page ? Whenever a state change is
needed, the virtual-address-to-page translation will be performed again. If
it follows a COW, it will get the new copied page. If it happens that no COW
was done, it should map to the original page. If the mapping is shared, the
kernel would update that shared page. If the mapping is private, then the
kernel would COW the page before updating it.

Thoughts ?

I think you are forgetting about page table entries. My understanding is
the process will have the VMAs copied on fork, but the page table
entries will be marked read-only. Then when the write access occurs, the
COW is created (since the PTE says readonly, but the VMA says writable).
However, that COW page is now only mapped within that forked process
page table.

This requires tracking the child memory descriptors in addition to the
parent. The most straightforward way I see this happening is requiring
user side to mmap the user_event_data fd that is used for write. This
way when fork occurs in dup_mm() / dup_mmap() that mmap'd
user_event_data will get open() / close() called per-fork. I could then
copy the enablers from the parent but with the child's memory descriptor
to allow proper lookup.

This is like fork before COW, it's a bummer I cannot see a way to do
this per-page. Doing the above would work, but it requires copying all
the enablers, not just the one that changed after the fork.

This brings an overall design concern I have with user-events: AFAIU, 

the lifetime of the user event registration appears to be linked to the 

lifetime of a file descriptor.

What happens when that file descriptor is duplicated and send over to 

another process through unix sockets credentials ? Does it mean that the 

kernel have a handle on the wrong process to update the "enabled" state?

Also, what happens on execve system call if the file descriptor 

representing the user event is not marked as close-on-exec ? Does it 

mean the kernel can corrupt user-space memory of the after-exec loaded 

binary when it attempts to update the "enabled" state ?

If I get this right, I suspect we might want to move the lifetime of the 

user event registration to the memory space (mm_struct).

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com