On 2022-10-10 09:04, Florian Weimer wrote:
* Mathieu Desnoyers:
Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.
The NUMA node ID field allows implementing a faster getcpu(2) in libc.
The virtual cpu id allows ideal scaling (down or up) of user-space
per-cpu data structures. The virtual cpu ids allocated within a memory
space are tracked by the scheduler, which takes into account the number
of concurrently running threads, thus implicitly considering the number
of threads, the cpu affinity, the cpusets applying to those threads, and
the number of logical cores on the system.
Do you have some code that shows how the userspace application handshake
is supposed to work with the existing three __rseq_* symbols? Maybe I'm
missing something.
see https://lore.kernel.org/lkml/20220922105941.237830-5-mathieu.desnoyers@xxxxxxxxxxxx/
+static
+unsigned int get_rseq_feature_size(void)
+{
+ unsigned long auxv_rseq_feature_size, auxv_rseq_align;
+
+ auxv_rseq_align = getauxval(AT_RSEQ_ALIGN);
+ assert(!auxv_rseq_align || auxv_rseq_align <= RSEQ_THREAD_AREA_ALLOC_SIZE);
+
+ auxv_rseq_feature_size = getauxval(AT_RSEQ_FEATURE_SIZE);
+ assert(!auxv_rseq_feature_size || auxv_rseq_feature_size <= RSEQ_THREAD_AREA_ALLOC_SIZE);
+ if (auxv_rseq_feature_size)
+ return auxv_rseq_feature_size;
+ else
+ return ORIG_RSEQ_FEATURE_SIZE;
+}
then in rseq_init():
+ rseq_feature_size = get_rseq_feature_size();
+ if (rseq_feature_size == ORIG_RSEQ_FEATURE_SIZE)
+ rseq_size = ORIG_RSEQ_ALLOC_SIZE;
+ else
+ rseq_size = RSEQ_THREAD_AREA_ALLOC_SIZE;
Then using it for e.g. node_id:
https://lore.kernel.org/lkml/20220922105941.237830-6-mathieu.desnoyers@xxxxxxxxxxxx/
+#ifndef rseq_sizeof_field
+#define rseq_sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+#ifndef rseq_offsetofend
+#define rseq_offsetofend(TYPE, MEMBER) \
+ (offsetof(TYPE, MEMBER) + rseq_sizeof_field(TYPE, MEMBER))
+#endif
+static inline bool rseq_node_id_available(void)
+{
+ return (int) rseq_feature_size >= rseq_offsetofend(struct rseq_abi, node_id);
+}
+
+/*
+ * Current NUMA node number.
+ */
+static inline uint32_t rseq_current_node_id(void)
+{
+ assert(rseq_node_id_available());
+ return RSEQ_ACCESS_ONCE(rseq_get_abi()->node_id);
+}
From an application perspective, it would be best to add 8 more shared
bytes in use, to push the new feature size over 32. This would be
clearly visible in __rseq_size, helping applications a lot.
[ I guess you meant 12 bytes ]
The fool-proof approach here would be to skip the 12 bytes of padding
currently at the end of struct rseq.
Maybe this is something we should do in order to make sure the userspace
check is regular for all fields.
Alternatively, we could sacrifice a bit to indicate that the this round
of extensions is present. But we'll need another bit to indicate that
the last remaining 4 bytes are in use, for consistency. Or come up with
something to put their today. The TID seems like an obvious choice.
Whatever we add into those bits would need to be "special" and use
something like a flag check to validate whether the field is populated
or not. Perhaps keeping things simpler and skipping those 12 bytes
entirely is preferable.
If we want to the 8 more bytes route, TID and PID should be
uncontroversal? The PID cache is clearly something that userspace
likes, not just as a defeat device for the old BYTE benchmark.
I agree that having the PID and TID there might be relevant, but I
would rather prefer to have all fields use a check that is regular
from the point of view of userspace. This minimizes the risk of user
errors.
Thoughts ?
Thanks,
Mathieu
Thanks,
Florian
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com