----- Den 11.nov.2016 14:35 skrev Wido den Hollander wido@xxxxxxxx: >> Op 11 november 2016 om 14:23 schreef Trygve Vea <trygve.vea@xxxxxxxxxxxxxxxxxx>: >> >> >> Hi, >> >> We recently experienced a problem with a single OSD. This occurred twice. >> >> The problem manifested itself thus: >> >> - 8 placement groups stuck peering, all of which had the problematic OSD as one >> of the acting OSDs in the set. >> - The OSD had a lot of active placement groups >> - The OSD were blocking IO on placement groups that were active (waiting for >> subops logged on the monitors) >> - The OSD logged that a single other OSD didn't respond to heartbeats. This OSD >> was not involved in any of the PG's stuck in peering. >> >> 2016-11-10 21:40:30.373352 7fad7fdc4700 -1 osd.29 32033 heartbeat_check: no >> reply from osd.14 since back 2016-11-10 21:40:03.465758 front 2016-11-10 >> 21:40:03.465758 (cutoff 2016-11-10 21:40:10.373339) >> >> This were logged until it was restarted. osd.14 in its turn logged a few >> instances of this: >> >> 2016-11-10 21:40:30.697238 7f1a8f9cb700 0 -- 10.20.9.21:6808/18024412 >> >> 10.20.9.22:6810/41828 pipe(0x7f1b15af8800 sd=20 :38625 s=2 pgs=9449 cs=1 l=0 >> c=0x7f1b499a4a80).fault, initiating reconnect >> 2016-11-10 21:40:30.697860 7f1a8a16f700 0 -- 10.20.9.21:6808/18024412 >> >> 10.20.9.22:6810/41828 pipe(0x7f1b15af8800 sd=20 :38627 s=1 pgs=9449 cs=2 l=0 >> c=0x7f1b499a4a80).connect got RESETSESSION >> >> >> No real IO-problem on the OSD against the disk. Using more CPU than usual, but >> no indication that the bottleneck is the drive, and the drive is healthy. >> >> >> Killing osd.29 off unblocked the traffic. The OSD were then started again, and >> things recovered nicely and things worked fine throughout the night. >> >> The next morning, the same behaviour as described above reoccurred on osd.29. >> Less PGs stuck in peering, but blocking IO. The OSD were then killed off, and >> have not been started since. I'm leaving it as it is if there is any >> possibility of using the OSD partition for forensics (ordinary xfs filesystem, >> journal on ssd). >> >> >> Not an expert of the low-level behaviour of Ceph, but the logged >> reconnection-attempts from osd.14, and the complaining about missing heartbeats >> on osd.29 sounds to me like a bug. >> >> Have anyone else seen this behaviour? >> > > Yes, but that usually indicates that there is something wrong with the network > or the machine. > > Is osd.29 alone on that machine? Did you verify that the network is OK? Any > firewalls present? There are four OSDs on each machine, and they are dedicated for OSDs. We've addressed a bottleneck where the buffers on the attached switch got full and we occasionally dropped packages, which I suspect contributed to this issue. There are no firewalls present. However, we can still see the occasional (2-3 a day, per OSD, on varying times) heartbeat_map: reset_timeout message. So there are still something funky going on here. We also experience significantly increased CPU footprint, which also started after we upgraded to Jewel; http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013959.html. -- Trygve _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com