Re: [RFC PATCH for 4.17 10/21] cpu_opv: Provide cpu_opv system call (v6)

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 28 Mar 2018 13:54:41 -0400 (EDT)

----- On Mar 28, 2018, at 11:22 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

> On Tue, Mar 27, 2018 at 12:05:31PM -0400, Mathieu Desnoyers wrote:
> 
>> 1) Allow algorithms to perform per-cpu data migration without relying on
>>    sched_setaffinity()
>> 
>> The use-cases are migrating memory between per-cpu memory free-lists, or
>> stealing tasks from other per-cpu work queues: each require that
>> accesses to remote per-cpu data structures are performed.
> 
> I think that one completely reduces to the per-cpu (spin)lock case,
> right? Because, as per the below, your logging case (8) can 'easily' be
> done without the cpu_opv monstrosity.
> 
> And if you can construct a per-cpu lock, that can be used to construct
> aribtrary logic.

The per-cpu spinlock does not have the same performance characteristics
as lock-free alternatives for various operations. A rseq compare-and-store
is faster than a rseq spinlock for linked-list operations.

> 
> And the difficult case for the per-cpu lock is the remote acquire; all
> the other cases are (relatively) trivial.
> 
> I've not really managed to get anything sensible to work, I've tried
> several variations of split lock, but you invariably end up with
> barriers in the fast (local) path, which sucks.
> 
> But I feel this should be solvable without cpu_opv. As in, I really hate
> that thing ;-)

I have not developed cpu_opv out of any kind of love for that solution.
I just realized that it did solve all my issues after failing for quite
some time to implement acceptable solutions for the remote access
problem, and for ensuring progress of single-stepping with current
debuggers that don't know about the rseq_table section.

> 
>> 8) Allow libraries with multi-part algorithms to work on same per-cpu
>>    data without affecting the allowed cpu mask
>> 
>> The lttng-ust tracer presents an interesting use-case for per-cpu
>> buffers: the algorithm needs to update a "reserve" counter, serialize
>> data into the buffer, and then update a "commit" counter _on the same
>> per-cpu buffer_. Using rseq for both reserve and commit can bring
>> significant performance benefits.
>> 
>> Clearly, if rseq reserve fails, the algorithm can retry on a different
>> per-cpu buffer. However, it's not that easy for the commit. It needs to
>> be performed on the same per-cpu buffer as the reserve.
>> 
>> The cpu_opv system call solves that problem by receiving the cpu number
>> on which the operation needs to be performed as argument. It can push
>> the task to the right CPU if needed, and perform the operations there
>> with preemption disabled.
>> 
>> Changing the allowed cpu mask for the current thread is not an
>> acceptable alternative for a tracing library, because the application
>> being traced does not expect that mask to be changed by libraries.
> 
> We talked about this use-case, and it can be solved without cpu_opv if
> you keep a dual commit counter, one local and one (atomic) remote.

Right.

> 
> We retain the cpu_id from the first rseq, and the second part will, when
> it (unlikely) finds it runs remotely, do an atomic increment on the
> remote counter. The consumer of the counter will then have to sum both
> the local and remote counter parts.

Yes, I did a prototype of this specific case with split-counters a while
ago. However, if we need cpu_opv as fallback for other reasons (e.g. remote
accesses), then the split-counters are not needed, and there is no need to
change the layout of user-space data to accommodate the extra per-cpu
counter.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html