I remember reading that ScaleIO (I think?) does something like this by regularly sending reports to a multicast group, thus any node with issues (or just overload) is reweighted or avoided automatically on the client. OSD map is the Ceph equivalent I guess. It makes sense to gather metrics and prioritize better performing OSDs over those with e.g. worse latencies, but it needs to update fast. But I believe that _network_ monitoring itself ought to be part of… a network monitoring system you should already have :-) and not a storage system that just happens to use network. I don’t remember seeing anything but a simple ping/traceroute/dns test in any SAN interface. If an OSD has issues it might be anything from a failing drive to a swapping OS and a number like “commit latency” (= response time average from the clients’ perspective) is maybe the ultimate metric of all for this purpose, irrespective of the root cause. Nice option would be to read data from all replicas at once - this would of course increase load and cause all sorts of issues if abused, but if you have an app that absolutely-always-without-fail-must-get-data-ASAP then you could enable this in the client (and I think that would be an easy option to add). This is actually used in some systems. Harder part is to fail nicely when writing (like waiting only for the remote network buffers on 2 nodes to get the data instead of waiting for commit on all 3 replicas…) Jan > On 31 Jul 2015, at 19:45, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > Even just a ping at max MTU set with nodefrag could tell a lot about > connectivity issues and latency without a lot of traffic. Using Ceph > messenger would be even better to check firewall ports. I like the > idea of incorporating simple network checks into Ceph. The monitor can > correlate failures and help determine if the problem is related to one > host from the CRUSH map. > - ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Jul 30, 2015 at 11:27 PM, Stijn De Weirdt wrote: >> wouldn't it be nice that ceph does something like this in background (some >> sort of network-scrub). debugging network like this is not that easy (can't >> expect admins to install e.g. perfsonar on all nodes and/or clients) >> >> something like: every X min, each service X pick a service Y on another host >> (assuming X and Y will exchange some communication at some point; like osd >> with other osd), send 1MB of data, and make the timing data available so we >> can monitor it and detect underperforming links over time. >> >> ideally clients also do this, but not sure where they should report/store >> the data. >> >> interpreting the data can be a bit tricky, but extreme outliers will be >> spotted easily, and the main issue with this sort of debugging is collecting >> the data. >> >> simply reporting / keeping track of ongoing communications is already a big >> step forward, but then we need to have the size of the exchanged data to >> allow interpretation (and the timing should be about the network part, not >> e.g. flush data to disk in case of an osd). (and obviously sampling is >> enough, no need to have details of every bit send). >> >> >> >> stijn >> >> >> On 07/30/2015 08:04 PM, Mark Nelson wrote: >>> >>> Thanks for posting this! We see issues like this more often than you'd >>> think. It's really important too because if you don't figure it out the >>> natural inclination is to blame Ceph! :) >>> >>> Mark >>> >>> On 07/30/2015 12:50 PM, Quentin Hartman wrote: >>>> >>>> Just wanted to drop a note to the group that I had my cluster go >>>> sideways yesterday, and the root of the problem was networking again. >>>> Using iperf I discovered that one of my nodes was only moving data at >>>> 1.7Mb / s. Moving that node to a different switch port with a different >>>> cable has resolved the problem. It took awhile to track down because >>>> none of the server-side error metrics for disk or network showed >>>> anything was amiss, and I didn't think to test network performance (as >>>> suggested in another thread) until well into the process. >>>> >>>> Check networking first! >>>> >>>> QH >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v0.13.1 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJVu7QoCRDmVDuy+mK58QAAcpAQAKbv6xPRxMMJ8NWrXym0 > NAtZFIYywvStKfTG2pL1xjb2p/xDM+6Z5mnYJTBHb+0dkGIO6qe0jF9t4XEE > ppH+55eIpkCZrKMdfN1L0vUe9ldFnJS2jsAlGkvzyRLJale++q1evymIAaWb > JnEZgV3pGrPTCRaVKNrT3NaGZVDLm6ygnsT6PYJaiXM8Av3equ00Uls2/i6v > vZhlIBz5TbKsNag/W7cRJVvjj7YDsgU+dplDl62mmDJ6o+cWvILlf9WPINdV > MrmIeg+7fqUEp8nuEzTMm+BDHQ3c/5cxrYr8bksiVoBTXV7m9fO0Je9Exn6N > iWTa5eDUBtR6Ha8WaVUib/cvFj6j94QRNWYmXHl9lG50p+XZ0L5bZ1G8v9Nb > gGxRoYgAncp9M1J+7Pvm5z8wZgxXAs/veUtrf+6SkUbGyCRnUSn/VS7C8syJ > 4WW2aWP/A0nxSDe1u+TGpkkPmhk7UDrJEfMQaZrFwS9FkFLfgLH7PxMcAZjJ > hlN129vldPh3QxLviLidlJmzUTvKtb+XrSkA0MjhFMJS2M79DR16j+XWe7Ub > wPnKpZcZ8WsQzOlTHtDEHQvhE3ilcm+4oALSiuqEAZKNKk8lUTtvfzJ2BKyu > Tv46c+Wf3LbwrdMnkGiMHLuIlqhQT2FzauM2Pi+Pt7QJ7L9xXfWW4vzdemxj > bBQD > =rPC0 > -----END PGP SIGNATURE----- > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com