Re: Missing heartbeats, OSD spending time reconnecting - possible bug?

Trygve Vea <trygve.vea@xxxxxxxxxxxxxxxxxx> · Mon, 28 Nov 2016 11:00:24 +0100 (CET)

----- Den 11.nov.2016 14:35 skrev Wido den Hollander wido@xxxxxxxx:
>> Op 11 november 2016 om 14:23 schreef Trygve Vea <trygve.vea@xxxxxxxxxxxxxxxxxx>:
>> 
>> 
>> Hi,
>> 
>> We recently experienced a problem with a single OSD.  This occurred twice.
>> 
>> The problem manifested itself thus:
>> 
>> - 8 placement groups stuck peering, all of which had the problematic OSD as one
>> of the acting OSDs in the set.
>> - The OSD had a lot of active placement groups
>> - The OSD were blocking IO on placement groups that were active (waiting for
>> subops logged on the monitors)
>> - The OSD logged that a single other OSD didn't respond to heartbeats.  This OSD
>> was not involved in any of the PG's stuck in peering.
>> 
>> 2016-11-10 21:40:30.373352 7fad7fdc4700 -1 osd.29 32033 heartbeat_check: no
>> reply from osd.14 since back 2016-11-10 21:40:03.465758 front 2016-11-10
>> 21:40:03.465758 (cutoff 2016-11-10 21:40:10.373339)
>> 
>> This were logged until it was restarted.  osd.14 in its turn logged a few
>> instances of this:
>> 
>> 2016-11-10 21:40:30.697238 7f1a8f9cb700  0 -- 10.20.9.21:6808/18024412 >>
>> 10.20.9.22:6810/41828 pipe(0x7f1b15af8800 sd=20 :38625 s=2 pgs=9449 cs=1 l=0
>> c=0x7f1b499a4a80).fault, initiating reconnect
>> 2016-11-10 21:40:30.697860 7f1a8a16f700  0 -- 10.20.9.21:6808/18024412 >>
>> 10.20.9.22:6810/41828 pipe(0x7f1b15af8800 sd=20 :38627 s=1 pgs=9449 cs=2 l=0
>> c=0x7f1b499a4a80).connect got RESETSESSION
>> 
>> 
>> No real IO-problem on the OSD against the disk.  Using more CPU than usual, but
>> no indication that the bottleneck is the drive, and the drive is healthy.
>> 
>> 
>> Killing osd.29 off unblocked the traffic.  The OSD were then started again, and
>> things recovered nicely and things worked fine throughout the night.
>> 
>> The next morning, the same behaviour as described above reoccurred on osd.29.
>> Less PGs stuck in peering, but blocking IO.  The OSD were then killed off, and
>> have not been started since.  I'm leaving it as it is if there is any
>> possibility of using the OSD partition for forensics (ordinary xfs filesystem,
>> journal on ssd).
>> 
>> 
>> Not an expert of the low-level behaviour of Ceph, but the logged
>> reconnection-attempts from osd.14, and the complaining about missing heartbeats
>> on osd.29 sounds to me like a bug.
>> 
>> Have anyone else seen this behaviour?
>> 
> 
> Yes, but that usually indicates that there is something wrong with the network
> or the machine.
> 
> Is osd.29 alone on that machine? Did you verify that the network is OK? Any
> firewalls present?

There are four OSDs on each machine, and they are dedicated for OSDs.

We've addressed a bottleneck where the buffers on the attached switch got full and we occasionally dropped packages, which I suspect contributed to this issue.

There are no firewalls present.

However, we can still see the occasional (2-3 a day, per OSD, on varying times) heartbeat_map: reset_timeout message.  So there are still something funky going on here.

We also experience significantly increased CPU footprint, which also started after we upgraded to Jewel; http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013959.html.

-- 
Trygve
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com