Re: s/Mutex/ceph::mutex/, lockdep

Radoslaw Zarzynski <rzarzyns@xxxxxxxxxx> · Fri, 16 Mar 2018 22:33:03 +0100

Hi Piotr,

On Wed, Mar 14, 2018 at 2:01 PM, Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> wrote:
> This was mentioned on perf meeting two (or three?) weeks ago and it turns
> out that it's not Mutex, but RWLock. See:

This is about Mutex. I returned there to experiment with some ideas
that came after taking a look on glibc's implementation details (spinning,
adaptive mutexes, lock elision with Intel TSX). RWLock is also interesting
point (it does the ref counting by default as you pointed out) but still, due
to popularity across the code, Mutex might give more chances for a bulky
micro-optimization IMO.

> Incrementing or decrementing simple int field isn't going to slow anything
> down by 5% unless it's some absurdly perverse workload, specifically
> tailored to exercise this thing.

In local testing I've observed 22.8, 22.0, 22.0 vs 23.5, 23.3, 23.3 kIOPS
in highly contended scenario (unbuffered 4 KiB randreads on NVME;
16 tp_osd_tp threads and my mobile, TSX-readiness-claiming Skylake) [1].

Sure, non-atomic increment carried by many threads in parallel isn't
usually an issue. Like always, the keyword is "usually". In contrast to its
atomic counterpart, CPU gets a chance to hide the latency by decoupling
the execution from the L1 data cache via store queue (entirely private
structure; other cores don't have access to it) and store-to-load forwarding
(the case of while(true) val++).
However, the cache line bounce still happens. AFAIK, when it comes to
write, the MESI family of cache coherence protocols fundamentally
requires CPU to be the line owner which may lead to issuing Request
For Ownership, and thus invalidations. There are situations where
the optimization might be knocked out and the full cost can be exposed.
I think about a memory barrier. On x86 LOCK {INC, DEC, CMPXCHG, ...}
acts like it requiring drainage of store and load queues. The cost of
an atomic operation is thus variable and depends not only on the number
of cores involved, but also on the other memory accesses around.

And we really have atomic in our wrapper. It's the one in pthread_mutex_t
that is used underneath the futex mechanism. Additionally, Mutex is split
across 2 cache lines. On my machine the sizeof(Mutex::name) is 32 while
pthread_mutex_t takes 40 bytes.
The assert-related stuff landed far from the atomic, almost at the end
of the structure potentially being a subject to ping-pong on 2 lines.

What is a bit BTW but really interesting is the impact on lock elision
with TSX. Any conflicting store (to the same cache line) during a memory
transaction translates into abort, possibly retries and fall-back to
the non-elided atomic RMW (and in turn even a syscall). Glibc has a facility
to covert plain mutexes into HW accelerated ones depending on the distro
vendor's choice. (will elaborate more on the elision in a separate post).

That's the possibility I see. Though, the amount of complexity involved
is so high that makes me headache. To be honest, I would love to don't
care about it and just go with the shared-nothing approach. SeaStar!

> There's nrlock and nwlock, both atomics and his findings make much more
> sense in that context. But then, there's a commit
> https://github.com/ceph/ceph/commit/8feb27d821c76d00f76064fa04559e493be7ac32
> which lets user disable usage of *both* nrlock and nwlock.

Looks interesting. What about making the developer's choice
compile-time? I haven't poked with RWLock but during the Mutex
pacification there was no Mutex usage that overrides the default-
valued parameters in a way that is truly run-time dependent.
This gives also a chance to differentiating policies basing on build
type (production, development). As somebody already pointed out,
C++17 improves things with if constexpr().

Regards,
Radosław Zarzyński

[1] https://gist.github.com/rzarzynski/f9387b308633d298f48454e97b2b152f

> Instead of going with full re-work that might or might not work, maybe let's
> add global option to disable/enable tracking on user demand, not just
> developer?
>
> Still, I don't mind getting rid of our wrapper in favor of std::mutex.
>
> --
> Piotr Dałek
> piotr.dalek@xxxxxxxxxxxx
> https://www.ovh.com/us/
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html