Question/idea about performance problems with a few overloaded OSDs

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Tue, 21 Oct 2014 19:15:04 +0200

Hi,

I've yet to install 0.80.7 on one node to confirm its stability and use
the new IO prirority tuning parameters enabling prioritized access to
data from client requests.

In the meantime, faced with large slowdowns caused by resync or external
IO load (although external IO load is not expected it can happen in
migrations from other storage solutions like in our recent experience)
I've got an idea related to the underlying problem (IO load concurrent
with client requests or even concentrated client-requests) that might
already be implemented (or not being of much value) so I'll write it
down to get feedback.

When IO load is not balanced correctly across OSDs the most loaded OSD
becomes a bottleneck in both write and read requests and for many
(most?) workloads will become a bottleneck for the whole storage network
as seen by the client. This happened to us on numerous occasions (low
filesystem performance, OSD restarts triggering backfills or recoveries)
For read requests would it be beneficial for OSDs to communicate with
their peers to find out their recent IO mean/median/... service time and
make OSDs able to proxy requests to less loaded nodes when they are
substantially more loaded than their peers?
If the additional network load generated by proxying requests proves
detrimental to the overall performance, maybe an update to librados to
accept a hint to redirect read requests for a given PG and a given
period might be a solution.

I understand that even if this is possible for read requests this
doesn't apply to write requests because they are synchronized across all
replicas. That said diminishing read load on one OSD without modifying
write behavior will obviously help the OSD process write requests faster.
If the general idea isn't bad or already obsoleted by another it's
obviously not trivial. For example it can create unstable feedback loops
so if I were to try and implement it I'll probably start with a
"selective" proxy/redirect with a probability of proxying/redirecting
being computed from the respective loads of all OSDs storing a given PG
to avoid "ping-pong" situations where read requests overload OSDs before
overloading another and coming round again.

Any thought? Is it based on wrong assumptions? Would it prove to be a
can of worms if someone tried to implement it?

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com