Re: Question/idea about performance problems with a few overloaded OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/21/2014 01:06 PM, Lionel Bouton wrote:
Hi Gregory,

Le 21/10/2014 19:39, Gregory Farnum a écrit :
On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton <lionel+ceph@xxxxxxxxxxx> wrote:
[...]
Any thought? Is it based on wrong assumptions? Would it prove to be a
can of worms if someone tried to implement it?
Yeah, there's one big thing you're missing: we strictly order reads
and writes to an object, and the primary is the serialization point.

Of course... I should have anticipated this. As you explain later
(thanks for the detailed explanation by the way) implementing redirect
would need a whole new way of coordinating accesses. I'm not yet
familiar with Ceph internals but I suspect this would mutate Ceph in
another beast entirely...


If we were to proxy reads to another replica it would be easy enough
for the primary to continue handling the ordering, but if it were just
a redirect it wouldn't be able to do so (the primary doesn't know when
the read is completed, allowing it to start a write). Setting up the
proxy of course requires a lot more code, but more importantly it's
more resource-intensive on the primary, so I'm not sure if it's worth
it. :/

Difficult to know without real-life testing. It's a non-trivial
CPU/network/disk capacity trade-off...

The "primary affinity" value we recently introduced is designed to
help alleviate persistent balancing problems around this by letting
you reduce how many PGs an OSD is primary for without changing the
location of the actual data in the cluster. But dynamic updates to
that aren't really feasible either (it's a map change and requires
repeering). [...]

I forgot about this. Thanks for the reminder: this definitely would help
in some of my use cases where the load is predictable over a relatively
long period.

I'll have to dig into the sources one day, I can't stop wondering about
various aspects of the internals since I began using Ceph (I've worked
on the code of distributed systems on several occasions and I've always
been hooked easily)...

At some point I'd like to experiment with creating some kind of datastore proxy layer to sit below OSDs and do a sort of similar scheme where latency statistics are tracked and writes get directed to different stores. The idea would be to generally keep the benefits of crush and deterministic placement (at least as far as getting the data to some OSD on a node), but then allow some level of flexibility in terms of avoiding hotspots on specific disks (heavy reads, seek contention, vibration, leveldb compaction stalls, etc). This unfortunately reintroduces something like a lookup table, but perhaps at the node level this could be made fast enough that it wouldn't be as much of a problem.

I don't know if this would actually work in practice, but I think it would be a very interesting project to explore.

Mark


Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux