Re: multi-threaded monitor

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 3 May 2021 16:23:26 -0700

On Wed, Apr 21, 2021 at 1:00 AM Kefu Chai <kchai@xxxxxxxxxx> wrote:
>
> hi folks,
>
> while looking at https://github.com/ceph/ceph/pull/32422, i think a probably safer approach is to make the monitor more efficient. currently, monitor is sort of a single-threaded application. quite a few critical code paths of monitor are protected by Monitor::lock, among other things
>
> - periodical task performed by tick() which is in turn called by SafeTimer. the "safty" of the SafeTimer is ensured by Monitor::lock
> - Monitor::_ms_dispatch is also called with the Monitor::lock acquired. in the case of https://github.com/ceph/ceph/pull/32422, one or more kcephfs clients are even able to slow down the whole cluster by asking for the latest osdmap with an ancient one in its hand, if the cluster is able to rebalance/recover in speedy way and accumulate lots of osdmap in a short time.
>
> a typical scaring use case is:
>
> 1. an all-flash cluster just completes a rebalance/recover. the rebalance completed quickly, and it leaves the cluster with a ton of osdmaps before some of the clients have a chance to pick up these updated maps.
> 2. (kcephfs) clients with ancient osdmaps in their hands wake up randomly, and they want the latest osdmap!
> 3. monitors are occupied with loading the maps from rocksdb and encoding them in very large batches (when discussing with the author of https://github.com/ceph/ceph/pull/32422, he mentioned that the total size of inc osdmap could be up to 200~300 MiB).
> 4. and the cluster is basically unresponsive.
>
> so, does it sound like a right way to improve its performance when serving the CPU intensive workload by dissecting the data dependencies in the monitor and to explore the possibility to make the monitor more multi-threaded?

I know I'm a bit behind on this thread, but I wanted to chime in
briefly. The ideas around lighter-weight OSDMaps for clients have some
merit and I look forward to seeing where that goes, but I'm scared
when multi-threaded monitors come up.

The basic problem is that the monitors by their nature, as a paxos
system, need to linearize operations into a single data structure, and
that data structure must be exactly ordered whenever updates happen.
And the way we handle operations with preprocess_*() and prepare_*()
makes any attempt at multithreading those updates really, really scary
— if for instance we are reading from an out-of-date version in
preprocess, we might inadvertently reject an update which would apply
to the real pending value.

So there are options around doing things like RCU models for handling
subscriptions that would solve a lot of the problems we see, and that
would be worth exploring. But any effort that tries to do fine-grained
locking and identify which locks are needed for which commands are
just really scary, and I'd like to avoid going down that road.
-Greg

>
> thoughts?
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx