----- On Jul 11, 2020, at 11:54 AM, Christian Brauner christian.brauner@xxxxxxxxxx wrote: > > The registration is a thread-group property I'll assume, right, i.e. all > threads will have rseq TLS or no thread will have it? No, rseq registration is a per-thread property, but it would arguably be a bit weird for a thread-group to observe different registration states for different threads. > Some things I seem to be able to assume (correct me if I'm wrong): > - rseq registration is expected to be done at thread creation time True. > - rseq registration _should_ only be done once, i.e. if a caller detects > that rseq is already registered for a thread, then they could > technically re-register rseq - risking a feature mismatch if doing so > - but they shouldn't re-register but simply assume that someone else > is in control of rseq. If they violate that assumption than you're > hosed anyway. Right. > So it seems as long as callers leave rseq registration alone whenever > they detect that it is already registered then one can assume that rseq > is under consistent control of a single library/program. If that's the > case it should safe to assume that the library will use the same rseq > registration for all threads bounded by the available kernel features or > by the set of features it is aware of. The rseq registration is per-thread. But yes, as soon as one user registers rseq, other users for that thread are expected to piggy-back on that registration. > I proposed that specific scheme because I was under the impression that > you are in need of a mechanism for a caller (at thread creation I > assume) to check what feature set is supported by rseq _without_ > issung a system call. If you were to record the requested flags in > struct rseq or in some other way, then another library which tries to > register rseq for a thread but detects it has already been registered > can look at e.g. whether the reliable cpu feature is around and then > adjust it's behavior accordingly. > Another reason why this seems worthwhile is because of how rseq works in > general. Since it registers a piece of userspace memory which userspace > can trivially access enforcing that userspace issue a syscall to get at > the feature list seems odd when you can just record it in the struct. > But that's a matter of style, I guess. Good points. > > Also, I'm thinking about the case of adding one or two new features that > introduce mutually exclusive behavior for rseq, i.e. if you register > rseq with FEAT1 and someone else registers it with FEAT2 and FEAT1 and > FEAT2 would lead to incompatible behavior for an aspect or all of rseq. > Even if you had a way to query the kernel for FEAT1 and FEAT2 in the > rseq syscall it still be a problem since a caller wouldn't know at rseq > registration time whether the library registered rseq with FEAT1 or > FEAT2. If you record the behavior somewhere - kernel_flags or whatever - > then the caller could check at registration time "ah, rseq is registered > with this behavior" I need to adjust my behavior. I think what we want here is to be able to extend the feature set, but not "pick and choose" different incompatible features. [...] >> >> One additional thing to keep in mind: the application can itself choose >> to define the __rseq_abi TLS, which AFAIU (please let me know if I am >> wrong) would take precedence over glibc's copy. So extending the > > You mean either that an application could simply choose to ignore that e.g. > glibc has already registered rseq and e.g. unregister it and register > it's own or that it registers it's own rseq before glibc comes into the > play? No quite. I mean that an application binary or a preloaded library is allowed to interpose with glibc and expose a __rseq_abi symbol with a size smaller than glibc's __rseq_abi. The issue is that glibc (or other library responsible for rseq registration) is unaware of that symbol's size. This means that extending __rseq_abi cannot be done by means of additional flags or parameters passed directly from the registration owner to the rseq system call. > I mean, if I interpreted what you're trying to say correctly, I think > one needs to work under the assumption that if someone else has already > registered rseq than it becomes the single source of truth. I think that > basically needs to be the contract. Same way you expect a user of > pthreads to not suddenly go and call clone3() with CLONE_THREAD | > CLONE_VM | [...] and assume that this won't mess with glibc's internal > state. Right. The issue is not about which library owns the registration, but rather making sure everyone agree on the size of __rseq_abi, including interposition scenarios. [...] >> >> For both approaches, we could either pass them as parameters with rseq >> registration, and make rseq registration success conditional on the >> kernel implementing those feature/fix-version, or validate the flag/version >> separately from registration. >> >> If this is done on registration, it means glibc will eventually have to >> handle this. This prevents user libraries with specific needs to query >> whether their features are available. Doing the feature/version validation >> separately from registration allows each user library to make its own >> queries and take advantage of new kernel features before glibc is >> upgraded to be made aware of them. > > Why isn't there a "dual scheme"? I.e. you record the features somewhere > in struct rseq or some other place so userspace can query an already > registered thread for the features it was registered with and have a way > to query the kernel for the supported features through the system call > (See what I suggested above with the feature checking flags.). This discussion got my head into gears over the weekend, and I think I came up with something that could elegantly solve all the "rseq extensibility" problem. More below. [...] > I really think this is not an rseq specific problem. This seems to be a > generic problem any interface has when it e.g. makes use of an extended > struct and e.g. decides to make its own syscalls without going through > the glibc wrappers (If there are any...). That's the reality right now > and will likely always be that way short of us blocking non-libc > syscalls similar to some bsds at which point we need a 1:1 kernel + libc > development. :) That's not going to happen. The problem ranges from > struct statx to struct clone_args and struct open_how and so on. AFAIU the only "special" thing about rseq is that its __rseq_abi is a TLS symbol shared between application/libraries, and interposition is allowed. > > But one question. Musn't the assumption be that all threads in a > thread-group if they have registered rseq then the same > application/library has done that registration? No, __rseq_abi is a TLS, and the registration is per-thread. > And if that's the case > then the application/library doing the registration is what defines the > layout for that thread-group and becomes the one source of truth. > Basically, if an application uses it's own rseq then glibc must be out > of the picture. If that's not part of the contract then it feels to me > that rseq cannot be extended (for now). Indeed, the new scheme I have in mind for rseq extensibility would allow new features to be used between new application/library and kernel even with an older glibc which would contain the rseq registration code, but be unaware of those new features. [...] > > Wouldn't the non-updated application just access fields and methods of > the smaller struct? The way struct extensions work is that we only > extend them after the last member and always correctly 64 bit aligned. > And as long as you only extend the struct at the end wouldn't that be > ok? An application with a non-updated/smaller struct rseq would just > access fields that the larger struct supports, no? The issue is symbol interposition, as discussed above. So here is the idea which emerged through the weekend as I was thinking about your email: * Current technical constraints - struct rseq __rseq_abi can be interposed by preloaded libraries and application, without knowledge from the registration "owner" (typically glibc). * Objectives: - Allow extending the size of struct rseq to add additional features, such as node_id field, signal-disabling fields, and so on. - Allow extending this size without requiring an upgrade of the library performing rseq registration. Simply upgrading the rseq "user" application or library and the kernel should suffice. * Proposed ABI - Reserve a bit in the field (struct rseq *)->flags of the TLS __rseq_abi, named e.g.: RSEQ_CS_FLAG_SIZE = (1U << 3). - A definition wishing to extend struct rseq would be required to initialize __rseq_abi with this bit set in the flags field. - When RSEQ_CS_FLAG_SIZE is set, struct rseq is guaranteed to have two new fields after the flags field: a __u32 user_size and a __u32 kernel_size field. - The user_size field is meant to be initialized to sizeof(struct rseq) by the __rseq_abi definition. In an interposition scenario, a kernel supporting this size feature will know about the size of the symbol by checking both the RSEQ_CS_FLAG_SIZE flag and the user_size field. - On registration, if the __rseq_abi.flags RSEQ_CS_FLAG_SIZE flag is set (and this flag is supported by the kernel), the kernel updates the kernel_size field to min(sizeof(struct rseq), __rseq_abi.user_size), effectively the subset of size supported by both the user-space __rseq_abi definition and by the kernel. - Users wishing to use additional fields beyond __rseq_abi.flags would need to check that __rseq_abi->flags & RSEQ_CS_FLAG_SIZE is true, and that __rseq_abi.kernel_size >= offsetof(struct rseq, feature_field) + sizeof(__rseq_abi.feature_field) This would ensure the fields are only used if the symbol is large enough to hold them *and* the kernel supports them. With this kind of scheme, we could then easily extend struct rseq to cover additional use-cases such as: - adding a new "node_id" field to speed up getcpu(3), user-space locking on NUMA, and possibly memory allocators, - adding fields allowing to quickly disable/enable signals, - adding a "__u64 features" field, which would contain for instance RSEQ_FEATURE_RELIABLE_CPU_ID. I'm not sure why I did not think of this earlier, but it all seems to fit nicely. Any comments ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com