Re: Check networking first?

Ben Hines <bhines@xxxxxxxxx> · Fri, 31 Jul 2015 17:19:24 -0700

I encountered a similar problem. Incoming firewall ports were blocked
on one host. So the other OSDs kept marking that OSD as down. But, it
could talk out, so it kept saying 'hey, i'm up, mark me up' so then
the other OSDs started trying to send it data again, causing backed up
requests.. Which goes on, ad infinitum. I had to figure out the
connectivity problem myself by looking in the OSD logs.

After a while, the cluster should just say 'no, you're not reachable,
stop putting yourself back into the cluster'.

-Ben

On Fri, Jul 31, 2015 at 11:21 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> I remember reading that ScaleIO (I think?) does something like this by regularly sending reports to a multicast group, thus any node with issues (or just overload) is reweighted or avoided automatically on the client. OSD map is the Ceph equivalent I guess. It makes sense to gather metrics and prioritize better performing OSDs over those with e.g. worse latencies, but it needs to update fast. But I believe that _network_ monitoring itself ought to be part of… a network monitoring system you should already have :-) and not a storage system that just happens to use network. I don’t remember seeing anything but a simple ping/traceroute/dns test in any SAN interface. If an OSD has issues it might be anything from a failing drive to a swapping OS and a number like “commit latency” (= response time average from the clients’ perspective) is maybe the ultimate metric of all for this purpose, irrespective of the root cause.
>
> Nice option would be to read data from all replicas at once - this would of course increase load and cause all sorts of issues if abused, but if you have an app that absolutely-always-without-fail-must-get-data-ASAP then you could enable this in the client (and I think that would be an easy option to add). This is actually used in some systems. Harder part is to fail nicely when writing (like waiting only for the remote network buffers on 2 nodes to get the data instead of waiting for commit on all 3 replicas…)
>
> Jan
>
>> On 31 Jul 2015, at 19:45, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Even just a ping at max MTU set with nodefrag could tell a lot about
>> connectivity issues and latency without a lot of traffic. Using Ceph
>> messenger would be even better to check firewall ports. I like the
>> idea of incorporating simple network checks into Ceph. The monitor can
>> correlate failures and help determine if the problem is related to one
>> host from the CRUSH map.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Thu, Jul 30, 2015 at 11:27 PM, Stijn De Weirdt  wrote:
>>> wouldn't it be nice that ceph does something like this in background (some
>>> sort of network-scrub). debugging network like this is not that easy (can't
>>> expect admins to install e.g. perfsonar on all nodes and/or clients)
>>>
>>> something like: every X min, each service X pick a service Y on another host
>>> (assuming X and Y will exchange some communication at some point; like osd
>>> with other osd), send 1MB of data, and make the timing data available so we
>>> can monitor it and detect underperforming links over time.
>>>
>>> ideally clients also do this, but not sure where they should report/store
>>> the data.
>>>
>>> interpreting the data can be a bit tricky, but extreme outliers will be
>>> spotted easily, and the main issue with this sort of debugging is collecting
>>> the data.
>>>
>>> simply reporting / keeping track of ongoing communications is already a big
>>> step forward, but then we need to have the size of the exchanged data to
>>> allow interpretation (and the timing should be about the network part, not
>>> e.g. flush data to disk in case of an osd). (and obviously sampling is
>>> enough, no need to have details of every bit send).
>>>
>>>
>>>
>>> stijn
>>>
>>>
>>> On 07/30/2015 08:04 PM, Mark Nelson wrote:
>>>>
>>>> Thanks for posting this!  We see issues like this more often than you'd
>>>> think.  It's really important too because if you don't figure it out the
>>>> natural inclination is to blame Ceph! :)
>>>>
>>>> Mark
>>>>
>>>> On 07/30/2015 12:50 PM, Quentin Hartman wrote:
>>>>>
>>>>> Just wanted to drop a note to the group that I had my cluster go
>>>>> sideways yesterday, and the root of the problem was networking again.
>>>>> Using iperf I discovered that one of my nodes was only moving data at
>>>>> 1.7Mb / s. Moving that node to a different switch port with a different
>>>>> cable has resolved the problem. It took awhile to track down because
>>>>> none of the server-side error metrics for disk or network showed
>>>>> anything was amiss, and I didn't think to test network performance (as
>>>>> suggested in another thread) until well into the process.
>>>>>
>>>>> Check networking first!
>>>>>
>>>>> QH
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v0.13.1
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJVu7QoCRDmVDuy+mK58QAAcpAQAKbv6xPRxMMJ8NWrXym0
>> NAtZFIYywvStKfTG2pL1xjb2p/xDM+6Z5mnYJTBHb+0dkGIO6qe0jF9t4XEE
>> ppH+55eIpkCZrKMdfN1L0vUe9ldFnJS2jsAlGkvzyRLJale++q1evymIAaWb
>> JnEZgV3pGrPTCRaVKNrT3NaGZVDLm6ygnsT6PYJaiXM8Av3equ00Uls2/i6v
>> vZhlIBz5TbKsNag/W7cRJVvjj7YDsgU+dplDl62mmDJ6o+cWvILlf9WPINdV
>> MrmIeg+7fqUEp8nuEzTMm+BDHQ3c/5cxrYr8bksiVoBTXV7m9fO0Je9Exn6N
>> iWTa5eDUBtR6Ha8WaVUib/cvFj6j94QRNWYmXHl9lG50p+XZ0L5bZ1G8v9Nb
>> gGxRoYgAncp9M1J+7Pvm5z8wZgxXAs/veUtrf+6SkUbGyCRnUSn/VS7C8syJ
>> 4WW2aWP/A0nxSDe1u+TGpkkPmhk7UDrJEfMQaZrFwS9FkFLfgLH7PxMcAZjJ
>> hlN129vldPh3QxLviLidlJmzUTvKtb+XrSkA0MjhFMJS2M79DR16j+XWe7Ub
>> wPnKpZcZ8WsQzOlTHtDEHQvhE3ilcm+4oALSiuqEAZKNKk8lUTtvfzJ2BKyu
>> Tv46c+Wf3LbwrdMnkGiMHLuIlqhQT2FzauM2Pi+Pt7QJ7L9xXfWW4vzdemxj
>> bBQD
>> =rPC0
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com