On Sun, Oct 12, 2014 at 9:29 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: > > > On 12/10/2014 18:22, Gregory Farnum wrote: >> On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>> >>> >>> On 12/10/2014 17:48, Gregory Farnum wrote: >>>> On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>>>> Hi, >>>>> >>>>> On a 0.80.6 cluster the command >>>>> >>>>> ceph tell osd.6 version >>>>> >>>>> hangs forever. I checked that it establishes a TCP connection to the OSD, raised the OSD debug level to 20 and I do not see >>>>> >>>>> https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991 >>>>> >>>>> in the logs. All other OSDs answer to the same "version" command as they should. And ceph daemon osd.6 version on the machine running OSD 6 responds as it should. There also are an ever growing number of slow requests on this OSD. But not error in the logs. In other words, except for taking forever to answer any kind of request the OSD looks fine. >>>>> >>>>> Another OSD running on the same machine is behaving well. >>>>> >>>>> Any idea what that behaviour relates to ? >>>> >>>> What commands have you run? The admin socket commands don't require >>>> nearly as many locks, nor do they go through the same event loops that >>>> messages do. You might have found a deadlock or something. (In which >>>> case just restarting the OSD would probably fix it, but you should >>>> grab a core dump first.) >>> >>> # /etc/init.d/ceph stop osd.6 >>> === osd.6 === >>> Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done >>> root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6 >>> === osd.6 === >>> Starting Ceph osd.6 on g3... >>> starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 /var/lib/ceph/osd/ceph-6/journal >>> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version >>> { "version": "ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)"} >>> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version >>> >>> and now it blocks. It looks like a deadlock happens shortly after it boots. >> >> Is this the same cluster you're reporting on in the tracker? > > Yes, it is the same cluster as http://tracker.ceph.com/issues/9750 although I can't imagine how the two could be related, they probably are. > >> Anyway, apparently it's a disk state issue. I have no idea what kind >> of bug in Ceph could cause this, so my guess is that a syscall is >> going out to lunch — although that should get caught up in the >> internal heartbeat checkin code. Like I said, grab a core dump and >> look for deadlocks or blocked sys calls in the filestore. > > I created http://tracker.ceph.com/issues/9751 and attached the log with debug_filestore = 20. There are many slow requests but I can't relate them to any kind of error. > > It does not core dump, should I kill it to get a coredump and then examine it ? I've never tried that ;-) That's what I was thinking; you send it a SIGQUIT signal and it'll dump. Or apparently you can use "gcore" instead, which won't quit it. The log doesn't have anything glaringly obvious; was it already "hung" when you packaged that? If so, it must be some kind of deadlock and the backtraces from the core dump will probably tell us what happened. > One way or the other the problem will be fixed soon (tonight). I'd like to take advantage of the broken state we have to figure it out. Resurecting the OSD that may unblock http://tracker.ceph.com/issues/9751 and may also unblock http://tracker.ceph.com/issues/9750 and we'll lose a chance to diagnose this rare condition. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com