On Fri, Oct 22, 2021 at 10:38:11PM +0900, Masami Hiramatsu wrote: > On Mon, 18 Oct 2021 16:09:57 -0700 > Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote: > > of the ioctl indicates the byte in the mmap to use for status and the index > > to use for writing. Both are 16-bits in size and are accessible via the > > INDEX_STATUS and INDEX_WRITE macros within user_events.h. > > Hmm, if we uses the ioctl(), the return value must be int because it can > return -1 and kernel return value is stored into the errno if there is > any error. I think you can introduce a data structure to return the > id and offset. (Or make id 15-bits :)) > Yeah, I think I will introduce a struct to aid with this and some others things like flags / print_fmt being specified (See below for those). > > int data_fd = open("user_events_data", O_RDWR); > > long result = ioctl(data_fd, DIAG_IOCSREG, "test"); > > int status_id = INDEX_STATUS(result); > > int write_id = INDEX_WRITE(result); > > > > struct iovec io[2]; > > io[0].iov_base = &write_id; > > io[0].iov_len = sizeof(write_id); > > io[1].iov_base = payload; > > io[1].iov_len = sizeof(payload); > > > > if (page_data[status_id]) > > writev(data_fd, io, 2); > > OK, so you switch to the 1:n model. Using writev() is a good idea. > (BTW, what happen if I use write()?) > The first 4 bytes are used regardless if writev() or write() is used. Both turn into an iterator to make the code simpler instead of handling each case differently. > > > > Users can describe the trace event format via the following format: > > name [field1;field2] > > Does this mean it always need 2 fields? > No, they need 0-N fields, perhaps I should leave off the second field in the description. > > > > Each field has the following format: > > type name size > > Why do you need the "size"? Since "type" must have the size information, > it is not needed. (or, user can define 1 byte "int" type...) > The idea is to allow user mode processes to describe their own data types if required. For that we need offset or size to be passed. If we require all user mode to only describe in types the kernel understands then I can remove the size field. > > > > Example for char array with a size of 20 named msg: > > echo 'u:detailed char[20] msg 20' >> dynamic_events > > cat dynamic_events > > u:detailed char[20] msg > > > > Data offsets are based on the data written out via write() and will be > > updated to reflect the correct offset in the trace_event fields. __data_loc > > types must be aware of the size of trace_entry/common properties to ensure > > proper decoding. > > As I said that the you have to add __rel_loc type (yes, you can do it) > so that the user don't need to consider the data location. That new type > decodeer code may need to be implemented afterwards in user-space tools > but anyway, that can be postponed. > Got it, does that mean I should just reference it here as a new type? I'm not sure what it means to add __rel_loc type since it appears anyone can pass any type down to trace_event and it will accept it. > > --- > > include/uapi/linux/user_events.h | 57 ++ > > kernel/trace/Kconfig | 15 + > > kernel/trace/Makefile | 1 + > > kernel/trace/trace_events_user.c | 1040 ++++++++++++++++++++++++++++++ > > 4 files changed, 1113 insertions(+) > > create mode 100644 include/uapi/linux/user_events.h > > create mode 100644 kernel/trace/trace_events_user.c > > BTW, if you can split this patch into several patches, it is easier to us > to review it. E.g. basic support for ftrace, add perf support, add bpf support, > documentation, optimize performance etc. > Sorry, are you asking for several isolated patches? Or a single patch with multiple commits to the same file? Typically after addition I am running a set of tests and benchmarks against the code, so having full functioning code is required to ensure I haven't broken anything. We are actively using these patches for various projects, which require ftrace, perf and eBPF to be functional. > > + > > +/* > > + * Parses the values of a field within the description > > + * Format: type name size [future] > > + */ > > As I said that the size is non-sense. And this should finally unified > with synthetic event field definition. > I have this for custom user types the kernel may not know the size of. Another reason is 32-bit process on a 64-bit kernel. The size will always be the kernel size of that type. If we don't care about either, then I can remove this. For custom types we are fine with using __rel_loc instead. > > +static int user_event_parse_field(char *field, struct user_event *user, > > + u32 *offset) > > +{ > > + char *part, *type, *name; > > + u32 size = 0, depth = 0, saved_offset = *offset; > > + > > + field = skip_spaces(field); > > + > > + if (strstr(field, "unsigned ") == field) { > > + type = field; > > + field = strpbrk(field + sizeof("unsigned"), " "); > > + > > + if (field == NULL) > > + return -EINVAL; > > + > > + *field++ = 0; > > + depth++; > > + } else if (strstr(field, "print_fmt ") == field) { > > + user->call.print_fmt = field + sizeof("print_fmt"); > > + return 0; > > What is this ? > We have several events that describe the print format to use instead of using the default, like we allow in the kernel. This is currently how it is passed, it could be described in the new struct to the reg ioctl instead. > > + } else if (strstr(field, "flag ") == field) { > > + field += sizeof("flag"); > > + > > + if (!strcmp(field, "bpf_iter")) > > + user->flags |= FLAG_BPF_ITER; > > + > > What is this flag? > We want to enable certain sensitive events the ability to mark that there should never be a buffer copy. When FLAG_BPF_ITER is used the raw iovecs are exposed out to eBPF instead of any sort of copy to reduce latency. We run user_events in some highly performant code and want to monitor things with the least amount of overhead possible. > > + name = kstrdup(raw_command, GFP_KERNEL); > > + > > + if (!name) > > + return -ENOMEM; > > + > > + mutex_lock(&event_mutex); > > + ret = user_event_parse_cmd(name, &user); > > I recommend you to split "parse" and "register" function, and > locks event_mutex only in the "register" function, since parsing > doesn't need to hold other events. > Sure > > +static bool user_event_match(const char *system, const char *event, > > + int argc, const char **argv, struct dyn_event *ev) > > +{ > > + struct user_event *user = container_of(ev, struct user_event, devent); > > + > > + return strcmp(EVENT_NAME(user), event) == 0 && > > + (!system || strcmp(system, USER_EVENTS_SYSTEM) == 0); > > For the minimal implementation, this is OK. > > Note that it is better to check the rest of arguments with the event > parameters. E.g. if you have "myuevent int var", user cant try to remove > "-:myuevent char var". > Ok, the reason we do this is because the events are going to be versioned. They might change their arguments and having a way to clear without knowing the exact arguments is not a bad thing in our cases. No one can remove an event while it's being used and the name cannot conflict since all user events go to the user_events sub-system and cannot be overriden. > > +/* > > + * Validates the user payload and writes via iterator. > > + */ > > +static ssize_t user_events_write_core(struct file *file, struct iov_iter *i) > > +{ > > + struct user_event_refs *refs; > > + struct user_event *user = NULL; > > + struct tracepoint *tp; > > + ssize_t ret = i->count; > > + int idx; > > + > > + preempt_disable(); > > Why do you disable preemption here? > My understanding was that preemption is to be disabled when using rcu protected by sched, which tracepoints and the refs are. We also want the correct processor to log the event (as near as we can). Did I misunderstand rcu_dereference_sched_check()? I may be off in the weeds. I also ran with CONFIG_PROVE_RCU and didn't see anything show up in dmesg. > > + > > + if (unlikely(!(cpu_online(raw_smp_processor_id())))) > > + goto out; > > This seems meaningless because this function is called in the > user process context. If the cpu is going to offline, the process > itself will be migrated before reaching here. > Sure, everywhere else had this and wasn't sure if this was possible to happen or not. I'll remove it. > > + > > + if (unlikely(copy_from_iter(&idx, sizeof(idx), i) != sizeof(idx))) { > > + ret = -EFAULT; > > + goto out; > > + } > > Isn't this copy_from_iter() sleepable function? > Yes, if we need to fault in data we want the data to come in. This is in the user process context of write() and writev() which I believe is the expected target for copy_from_iter(). During the probe copies we are not holding a lock or have interrupts disabled, under what circumstances would a page fault cause an issue in these paths? What is recommended here if we want to have page faults be allowed (Hoping they are rare)? > > + /* > > + * Probes advance the iterator so we > > + * need to have a copy for each probe. > > + */ > > + copy = *i; > > + > > + probe_func = probe_func_ptr->func; > > + tpdata = probe_func_ptr->data; > > + probe_func(user, ©, tpdata); > > You seems to try to copy in from user space in each probe func, but > please copy it here to the temporary buffer and pass it to the > each probe function. Such performacne optimization can postpone. > Start with simple implementation. > Yes, this avoids double copying of the data in the normal paths. Moving to a temp buffer only really changes 1 line in the probe functions (copy_from_iter to copy_from_user). If I were to create a temp buffer for simplicity I guess I would have to kmalloc on each call or move to a per-cpu buffer or use stack memory and limit how much data can be copied. These would likely add more lines than they remove from the code. > > + refs = rcu_dereference_sched(file->private_data); > > + > > + if (refs) { > > + count = refs->count; > > + > > + for (i = 0; i < count; ++i) > > + if (refs->events[i] == user) > > + return i; > > + } > > + > > + size = sizeof(*refs) + (sizeof(struct user_event *) * (count + 1)); > > + > > + new_refs = kzalloc(size, GFP_KERNEL); > > + > > + if (!new_refs) > > + return -ENOMEM; > > + > > + new_refs->count = count + 1; > > + > > + for (i = 0; i < count; ++i) > > + new_refs->events[i] = refs->events[i]; > > + > > + new_refs->events[i] = user; > > + > > + atomic_inc(&user->refs); > > + > > + rcu_assign_pointer(file->private_data, new_refs); > > + > > + if (refs) > > + kfree_rcu(refs, rcu); > > It seems you introduced non-needed complexity with the events lists. > Why not just adding new events on the current refs? > Please make this simple at the first step. Forget about performance, > but just start with simple code. > Callers are free to call the reg ioctl while a write is occurring on the same file. This can cause the kernel to access garbage. Either RCU or locking is required in these paths or the ABI needs a way to lock the file into a mode where no more events can be added (IE: Change to write mode). We want RCU because we have places where multiple processors will be emitting data out to the same file. We don't want the data/cache contention across processors to slow things down, which we have seen in other telemetry data. > > + case DIAG_IOCSREG: > > + /* Register/lookup on behalf of user process */ > > + name = strndup_user(ubuf, MAX_EVENT_DESC); > > + > > + if (IS_ERR(name)) { > > + ret = PTR_ERR(name); > > + goto out; > > + } > > + > > + mutex_lock(&event_mutex); > > + > > + ret = user_event_parse_cmd(name, &user); > > + > > + if (ret < 0) > > + goto reg_out; > > + > > + preempt_disable(); > > Why do you need to disable preemption here? It is enough to > use a mutex. > Because user_events_ref_add use rcu protected by sched. It appears that preemption is supposed to be disabled when using that from my read of the code. Is this not correct or not required? > > +/* > > + * Handles the final close of the file from user mode. > > + */ > > +static int user_events_release(struct inode *node, struct file *file) > > +{ > > + struct user_event_refs *refs; > > + struct user_event *user; > > + int i; > > + > > + preempt_disable(); > > I'm not sure why do you stick on using the RCU here. > Since the user-events are only used in the user-context, > which is not a critical section. > For example, if you use kprobes, it may be called from anywhere > in the kernel including critical section, and the handler will > run in the interrupt handler. Thus I should use RCU. But here, > the user events are written by syscalls, no critical section > at all... > > Can you try to remove RCU from this code? > If you are talking about this specific call site (user_events_release) then yeah I believe I can remove it. If you talking in general, we use RCU to protect against some scenarios where events are getting registered while previous events are getting written out. We chose RCU over readwrite locking due to potential for shared cache contention related slowdowns. > Thank you, > -- > Masami Hiramatsu <mhiramat@xxxxxxxxxx> Thank you for taking the time to review, appreciate it! Thanks, -Beau
![]() |