Re: Question/idea about performance problems with a few overloaded OSDs

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Tue, 21 Oct 2014 13:30:59 -0500

On 10/21/2014 01:06 PM, Lionel Bouton wrote:
Hi Gregory,

Le 21/10/2014 19:39, Gregory Farnum a écrit :
On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton <lionel+ceph@xxxxxxxxxxx> wrote:
[...]
Any thought? Is it based on wrong assumptions? Would it prove to be a
can of worms if someone tried to implement it?
Yeah, there's one big thing you're missing: we strictly order reads
and writes to an object, and the primary is the serialization point.

Of course... I should have anticipated this. As you explain later
(thanks for the detailed explanation by the way) implementing redirect
would need a whole new way of coordinating accesses. I'm not yet
familiar with Ceph internals but I suspect this would mutate Ceph in
another beast entirely...

If we were to proxy reads to another replica it would be easy enough
for the primary to continue handling the ordering, but if it were just
a redirect it wouldn't be able to do so (the primary doesn't know when
the read is completed, allowing it to start a write). Setting up the
proxy of course requires a lot more code, but more importantly it's
more resource-intensive on the primary, so I'm not sure if it's worth
it. :/

Difficult to know without real-life testing. It's a non-trivial
CPU/network/disk capacity trade-off...

The "primary affinity" value we recently introduced is designed to
help alleviate persistent balancing problems around this by letting
you reduce how many PGs an OSD is primary for without changing the
location of the actual data in the cluster. But dynamic updates to
that aren't really feasible either (it's a map change and requires
repeering). [...]

I forgot about this. Thanks for the reminder: this definitely would help
in some of my use cases where the load is predictable over a relatively
long period.

I'll have to dig into the sources one day, I can't stop wondering about
various aspects of the internals since I began using Ceph (I've worked
on the code of distributed systems on several occasions and I've always
been hooked easily)...

At some point I'd like to experiment with creating some kind of 
datastore proxy layer to sit below OSDs and do a sort of similar scheme 
where latency statistics are tracked and writes get directed to 
different stores.  The idea would be to generally keep the benefits of 
crush and deterministic placement (at least as far as getting the data 
to some OSD on a node), but then allow some level of flexibility in 
terms of avoiding hotspots on specific disks (heavy reads, seek 
contention, vibration, leveldb compaction stalls, etc).  This 
unfortunately reintroduces something like a lookup table, but perhaps at 
the node level this could be made fast enough that it wouldn't be as 
much of a problem.

I don't know if this would actually work in practice, but I think it 
would be a very interesting project to explore.

Mark

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com