On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton <lionel+ceph@xxxxxxxxxxx> wrote: > Hi, > > I've yet to install 0.80.7 on one node to confirm its stability and use > the new IO prirority tuning parameters enabling prioritized access to > data from client requests. > > In the meantime, faced with large slowdowns caused by resync or external > IO load (although external IO load is not expected it can happen in > migrations from other storage solutions like in our recent experience) > I've got an idea related to the underlying problem (IO load concurrent > with client requests or even concentrated client-requests) that might > already be implemented (or not being of much value) so I'll write it > down to get feedback. > > When IO load is not balanced correctly across OSDs the most loaded OSD > becomes a bottleneck in both write and read requests and for many > (most?) workloads will become a bottleneck for the whole storage network > as seen by the client. This happened to us on numerous occasions (low > filesystem performance, OSD restarts triggering backfills or recoveries) > For read requests would it be beneficial for OSDs to communicate with > their peers to find out their recent IO mean/median/... service time and > make OSDs able to proxy requests to less loaded nodes when they are > substantially more loaded than their peers? > If the additional network load generated by proxying requests proves > detrimental to the overall performance, maybe an update to librados to > accept a hint to redirect read requests for a given PG and a given > period might be a solution. > > I understand that even if this is possible for read requests this > doesn't apply to write requests because they are synchronized across all > replicas. That said diminishing read load on one OSD without modifying > write behavior will obviously help the OSD process write requests faster. > If the general idea isn't bad or already obsoleted by another it's > obviously not trivial. For example it can create unstable feedback loops > so if I were to try and implement it I'll probably start with a > "selective" proxy/redirect with a probability of proxying/redirecting > being computed from the respective loads of all OSDs storing a given PG > to avoid "ping-pong" situations where read requests overload OSDs before > overloading another and coming round again. > > Any thought? Is it based on wrong assumptions? Would it prove to be a > can of worms if someone tried to implement it? Yeah, there's one big thing you're missing: we strictly order reads and writes to an object, and the primary is the serialization point. If we were to proxy reads to another replica it would be easy enough for the primary to continue handling the ordering, but if it were just a redirect it wouldn't be able to do so (the primary doesn't know when the read is completed, allowing it to start a write). Setting up the proxy of course requires a lot more code, but more importantly it's more resource-intensive on the primary, so I'm not sure if it's worth it. :/ The "primary affinity" value we recently introduced is designed to help alleviate persistent balancing problems around this by letting you reduce how many PGs an OSD is primary for without changing the location of the actual data in the cluster. But dynamic updates to that aren't really feasible either (it's a map change and requires repeering). There are also relaxed consistency mechanisms that let clients read from a replica (randomly, or the one "closest" to them, etc), but with these there's no good way to get load data from the OSDs to the clients. So redirects of some kind sound like a good feature, but I'm not sure how one could go about implementing them reasonably. I think the actual proxy is probably the best bet, but that's an awful lot of code in critical places and with lots of dependencies whose performance/balancing benefits I'm a little dubious of. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com