-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Even just a ping at max MTU set with nodefrag could tell a lot about connectivity issues and latency without a lot of traffic. Using Ceph messenger would be even better to check firewall ports. I like the idea of incorporating simple network checks into Ceph. The monitor can correlate failures and help determine if the problem is related to one host from the CRUSH map. - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jul 30, 2015 at 11:27 PM, Stijn De Weirdt wrote: > wouldn't it be nice that ceph does something like this in background (some > sort of network-scrub). debugging network like this is not that easy (can't > expect admins to install e.g. perfsonar on all nodes and/or clients) > > something like: every X min, each service X pick a service Y on another host > (assuming X and Y will exchange some communication at some point; like osd > with other osd), send 1MB of data, and make the timing data available so we > can monitor it and detect underperforming links over time. > > ideally clients also do this, but not sure where they should report/store > the data. > > interpreting the data can be a bit tricky, but extreme outliers will be > spotted easily, and the main issue with this sort of debugging is collecting > the data. > > simply reporting / keeping track of ongoing communications is already a big > step forward, but then we need to have the size of the exchanged data to > allow interpretation (and the timing should be about the network part, not > e.g. flush data to disk in case of an osd). (and obviously sampling is > enough, no need to have details of every bit send). > > > > stijn > > > On 07/30/2015 08:04 PM, Mark Nelson wrote: >> >> Thanks for posting this! We see issues like this more often than you'd >> think. It's really important too because if you don't figure it out the >> natural inclination is to blame Ceph! :) >> >> Mark >> >> On 07/30/2015 12:50 PM, Quentin Hartman wrote: >>> >>> Just wanted to drop a note to the group that I had my cluster go >>> sideways yesterday, and the root of the problem was networking again. >>> Using iperf I discovered that one of my nodes was only moving data at >>> 1.7Mb / s. Moving that node to a different switch port with a different >>> cable has resolved the problem. It took awhile to track down because >>> none of the server-side error metrics for disk or network showed >>> anything was amiss, and I didn't think to test network performance (as >>> suggested in another thread) until well into the process. >>> >>> Check networking first! >>> >>> QH >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -----BEGIN PGP SIGNATURE----- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVu7QoCRDmVDuy+mK58QAAcpAQAKbv6xPRxMMJ8NWrXym0 NAtZFIYywvStKfTG2pL1xjb2p/xDM+6Z5mnYJTBHb+0dkGIO6qe0jF9t4XEE ppH+55eIpkCZrKMdfN1L0vUe9ldFnJS2jsAlGkvzyRLJale++q1evymIAaWb JnEZgV3pGrPTCRaVKNrT3NaGZVDLm6ygnsT6PYJaiXM8Av3equ00Uls2/i6v vZhlIBz5TbKsNag/W7cRJVvjj7YDsgU+dplDl62mmDJ6o+cWvILlf9WPINdV MrmIeg+7fqUEp8nuEzTMm+BDHQ3c/5cxrYr8bksiVoBTXV7m9fO0Je9Exn6N iWTa5eDUBtR6Ha8WaVUib/cvFj6j94QRNWYmXHl9lG50p+XZ0L5bZ1G8v9Nb gGxRoYgAncp9M1J+7Pvm5z8wZgxXAs/veUtrf+6SkUbGyCRnUSn/VS7C8syJ 4WW2aWP/A0nxSDe1u+TGpkkPmhk7UDrJEfMQaZrFwS9FkFLfgLH7PxMcAZjJ hlN129vldPh3QxLviLidlJmzUTvKtb+XrSkA0MjhFMJS2M79DR16j+XWe7Ub wPnKpZcZ8WsQzOlTHtDEHQvhE3ilcm+4oALSiuqEAZKNKk8lUTtvfzJ2BKyu Tv46c+Wf3LbwrdMnkGiMHLuIlqhQT2FzauM2Pi+Pt7QJ7L9xXfWW4vzdemxj bBQD =rPC0 -----END PGP SIGNATURE----- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com