Missing heartbeats, OSD spending time reconnecting - possible bug?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



We recently experienced a problem with a single OSD.  This occurred twice.

The problem manifested itself thus:

- 8 placement groups stuck peering, all of which had the problematic OSD as one of the acting OSDs in the set.
- The OSD had a lot of active placement groups
- The OSD were blocking IO on placement groups that were active (waiting for subops logged on the monitors)
- The OSD logged that a single other OSD didn't respond to heartbeats.  This OSD was not involved in any of the PG's stuck in peering.

2016-11-10 21:40:30.373352 7fad7fdc4700 -1 osd.29 32033 heartbeat_check: no reply from osd.14 since back 2016-11-10 21:40:03.465758 front 2016-11-10 21:40:03.465758 (cutoff 2016-11-10 21:40:10.373339)

This were logged until it was restarted.  osd.14 in its turn logged a few instances of this:

2016-11-10 21:40:30.697238 7f1a8f9cb700  0 -- >> pipe(0x7f1b15af8800 sd=20 :38625 s=2 pgs=9449 cs=1 l=0 c=0x7f1b499a4a80).fault, initiating reconnect
2016-11-10 21:40:30.697860 7f1a8a16f700  0 -- >> pipe(0x7f1b15af8800 sd=20 :38627 s=1 pgs=9449 cs=2 l=0 c=0x7f1b499a4a80).connect got RESETSESSION

No real IO-problem on the OSD against the disk.  Using more CPU than usual, but no indication that the bottleneck is the drive, and the drive is healthy.

Killing osd.29 off unblocked the traffic.  The OSD were then started again, and things recovered nicely and things worked fine throughout the night.

The next morning, the same behaviour as described above reoccurred on osd.29.  Less PGs stuck in peering, but blocking IO.  The OSD were then killed off, and have not been started since.  I'm leaving it as it is if there is any possibility of using the OSD partition for forensics (ordinary xfs filesystem, journal on ssd).

Not an expert of the low-level behaviour of Ceph, but the logged reconnection-attempts from osd.14, and the complaining about missing heartbeats on osd.29 sounds to me like a bug.

Have anyone else seen this behaviour?

Trygve Vea
ceph-users mailing list

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux