On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: > > > On 12/10/2014 17:48, Gregory Farnum wrote: >> On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>> Hi, >>> >>> On a 0.80.6 cluster the command >>> >>> ceph tell osd.6 version >>> >>> hangs forever. I checked that it establishes a TCP connection to the OSD, raised the OSD debug level to 20 and I do not see >>> >>> https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991 >>> >>> in the logs. All other OSDs answer to the same "version" command as they should. And ceph daemon osd.6 version on the machine running OSD 6 responds as it should. There also are an ever growing number of slow requests on this OSD. But not error in the logs. In other words, except for taking forever to answer any kind of request the OSD looks fine. >>> >>> Another OSD running on the same machine is behaving well. >>> >>> Any idea what that behaviour relates to ? >> >> What commands have you run? The admin socket commands don't require >> nearly as many locks, nor do they go through the same event loops that >> messages do. You might have found a deadlock or something. (In which >> case just restarting the OSD would probably fix it, but you should >> grab a core dump first.) > > # /etc/init.d/ceph stop osd.6 > === osd.6 === > Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done > root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6 > === osd.6 === > Starting Ceph osd.6 on g3... > starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 /var/lib/ceph/osd/ceph-6/journal > root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version > { "version": "ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)"} > root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version > > and now it blocks. It looks like a deadlock happens shortly after it boots. Is this the same cluster you're reporting on in the tracker? Anyway, apparently it's a disk state issue. I have no idea what kind of bug in Ceph could cause this, so my guess is that a syscall is going out to lunch — although that should get caught up in the internal heartbeat checkin code. Like I said, grab a core dump and look for deadlocks or blocked sys calls in the filestore. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com