On Mon, Feb 8, 2016 at 3:25 AM, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: > Is there an equivalent of 'ceph health' but for OSD ? > > Like warning about slowness or troubles with communication between OSDs? > > I've spent good amount of time debugging what looked like stuck pgs > only but it turned out to be bad NIC and it was only apparent once I > saw some OSD logs like > > 2016-02-08 03:42:27.810289 7fc9b8bff700 -1 osd.9 146800 heartbeat_check: no reply from osd.14 ever on either front or back, first ping sent 2016-02-08 03:39:24.860852 (cutoff 2016-02-08 03:39:27.810288) > 2016-02-08 03:42:27.810297 7fc9b8bff700 -1 osd.9 146800 heartbeat_check: no reply from osd.15 ever on either front or back, first ping sent 2016-02-08 03:39:24.860852 (cutoff 2016-02-08 03:39:27.810288) > 2016-02-08 03:42:28.311125 7fc9b8bff700 -1 osd.9 146800 heartbeat_check: no reply from osd.14 ever on either front or back, first ping sent 2016-02-08 03:39:24.860852 (cutoff 2016-02-08 03:39:28.311124) > > (turned out to be bad nic, fuck emulex) > > is there anything that could dump things like "failed heartbeats in > last 10 minutes" or similiar stats ? I don't think that's exposed anywhere — if it happens enough then the OSD will get killed. We could maybe add some tracking structures and an admin socket command to dump them from the OSD; you should create a feature request at tracker.ceph.com. :) -Greg > > -- > Mariusz Gronczewski, Administrator > > Efigence S. A. > ul. Wołoska 9a, 02-583 Warszawa > T: [+48] 22 380 13 13 > F: [+48] 22 380 13 14 > E: mariusz.gronczewski@xxxxxxxxxxxx > <mailto:mariusz.gronczewski@xxxxxxxxxxxx> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com