Re: Check networking first?

John Spray <jspray@xxxxxxxxxx> · Mon, 3 Aug 2015 11:19:41 +0100

On Fri, Jul 31, 2015 at 7:21 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> I remember reading that ScaleIO (I think?) does something like this by regularly sending reports to a multicast group, thus any node with issues (or just overload) is reweighted or avoided automatically on the client. OSD map is the Ceph equivalent I guess. It makes sense to gather metrics and prioritize better performing OSDs over those with e.g. worse latencies, but it needs to update fast. But I believe that _network_ monitoring itself ought to be part of… a network monitoring system you should already have :-) and not a storage system that just happens to use network. I don’t remember seeing anything but a simple ping/traceroute/dns test in any SAN interface. If an OSD has issues it might be anything from a failing drive to a swapping OS and a number like “commit latency” (= response time average from the clients’ perspective) is maybe the ultimate metric of all for this purpose, irrespective of the root cause.

Like a lot of system monitoring stuff, this is the kind of thing that
in an ideal world we wouldn't have to worry about, but the experience
in practice is that people deploy big distributed storage systems
without having really good monitoring in place.  We (people providing
storage) have an especially good motivation to provide basic network
issue detection, because without it we can be blamed for network
issues ("The storage is slow!" ... 1 week... "No, your infiniband
cable is kinked").

That said, the fact that we're motivated to write it doesn't mean it
has to be physically built into things like the osd and the mon, it
makes sense to keep things like this a bit separate.

> Nice option would be to read data from all replicas at once - this would of course increase load and cause all sorts of issues if abused, but if you have an app that absolutely-always-without-fail-must-get-data-ASAP then you could enable this in the client (and I think that would be an easy option to add). This is actually used in some systems. Harder part is to fail nicely when writing (like waiting only for the remote network buffers on 2 nodes to get the data instead of waiting for commit on all 3 replicas…)

Parallel reads have been talked about
https://wiki.ceph.com/Planning/Blueprints/Hammer/librados%3A_support_parallel_reads

(no idea if anyone has a working version of it yet).

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com