On Thu, Oct 5, 2017 at 12:01 PM, Olivier Bonvalet <ceph.list@xxxxxxxxx> wrote: > Le jeudi 05 octobre 2017 à 11:47 +0200, Ilya Dryomov a écrit : >> The stable pages bug manifests as multiple sporadic connection >> resets, >> because in that case CRCs computed by the kernel don't always match >> the >> data that gets sent out. When the mismatch is detected on the OSD >> side, OSDs reset the connection and you'd see messages like >> >> libceph: osd1 1.2.3.4:6800 socket closed (con state OPEN) >> libceph: osd2 1.2.3.4:6804 socket error on write >> >> This is a different issue. Josy, Adrian, Olivier, do you also see >> messages of the "libceph: read_partial_message ..." type or is it >> just >> "libceph: ... bad crc/signature" errors? > > I have "read_partial_message" too, for example : > > Oct 5 09:00:47 lorunde kernel: [65575.969322] libceph: read_partial_message ffff88027c231500 data crc 181941039 != exp. 115232978 > Oct 5 09:00:47 lorunde kernel: [65575.969953] libceph: osd122 10.0.0.31:6800 bad crc/signature > Oct 5 09:04:30 lorunde kernel: [65798.958344] libceph: read_partial_message ffff880254a25c00 data crc 443114996 != exp. 2014723213 > Oct 5 09:04:30 lorunde kernel: [65798.959044] libceph: osd18 10.0.0.22:6802 bad crc/signature > Oct 5 09:14:28 lorunde kernel: [66396.788272] libceph: read_partial_message ffff880238636200 data crc 1797729588 != exp. 2550563968 > Oct 5 09:14:28 lorunde kernel: [66396.788984] libceph: osd43 10.0.0.9:6804 bad crc/signature > Oct 5 10:09:36 lorunde kernel: [69704.211672] libceph: read_partial_message ffff8802712dff00 data crc 2241944833 != exp. 762990605 > Oct 5 10:09:36 lorunde kernel: [69704.212422] libceph: osd103 10.0.0.28:6804 bad crc/signature > Oct 5 10:25:41 lorunde kernel: [70669.203596] libceph: read_partial_message ffff880257521400 data crc 3655331946 != exp. 2796991675 > Oct 5 10:25:41 lorunde kernel: [70669.204462] libceph: osd16 10.0.0.21:6806 bad crc/signature > Oct 5 10:25:52 lorunde kernel: [70680.255943] libceph: read_partial_message ffff880245e3d600 data crc 3787567693 != exp. 725251636 > Oct 5 10:25:52 lorunde kernel: [70680.257066] libceph: osd60 10.0.0.23:6800 bad crc/signature OK, so both your and Josy's cases are actually the reverse: the kernel detects the mismatch, so it's definitely not stable pages related. When did you start seeing these errors? Can you correlate that to a ceph or kernel upgrade? If not, and if you don't see other issues, I'd write it off as faulty hardware. Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com