Thanks for the update. In the ops dump provided, the objecter is saying that OSD 46 hasn't responded to the deletion request of object rbd_data.e10ca56b8b4567.000000000000311c. Perhaps run "ceph daemon osd.46 dump_ops_in_flight" or "... dump_historic_ops" to see if that op is in the list? You can also run "ceph osd map <pool name> rbd_data.e10ca56b8b4567.000000000000311c" to verify that OSD 46 is the primary PG for that object. On Tue, May 16, 2017 at 3:14 PM, Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> wrote: > Hello Jason, > > i'm happy to tell you that i've currently one VM where i can reproduce > the problem. > >> The best option would be to run "gcore" against the running VM whose >> IO is stuck, compress the dump, and use the "ceph-post-file" to >> provide the dump. I could then look at all the Ceph data structures to >> hopefully find the issue. > > I've saved the dump but it will contain sensitive informations. I won't > upload it to a public server. I'll send you an private email with a > private server to download the core dump. Thanks! > >> Enabling debug logs after the IO has stuck will most likely be of >> little value since it won't include the details of which IOs are >> outstanding. You could attempt to use "ceph --admin-daemon >> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just >> stuck waiting on an OSD to respond. > > This is the output: > # ceph --admin-daemon > /var/run/ceph/ceph-client.admin.5295.140214539927552.asok objecter_requests > { > "ops": [ > { > "tid": 384632, > "pg": "5.bd9616ad", > "osd": 46, > "object_id": "rbd_data.e10ca56b8b4567.000000000000311c", > "object_locator": "@5", > "target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c", > "target_object_locator": "@5", > "paused": 0, > "used_replica": 0, > "precalc_pgid": 0, > "last_sent": "2.28554e+06s", > "attempts": 1, > "snapid": "head", > "snap_context": "a07c2=[]", > "mtime": "2017-05-16 21:03:22.0.196102s", > "osd_ops": [ > "delete" > ] > } > ], > "linger_ops": [ > { > "linger_id": 1, > "pg": "5.5f3bd635", > "osd": 17, > "object_id": "rbd_header.e10ca56b8b4567", > "object_locator": "@5", > "target_object_id": "rbd_header.e10ca56b8b4567", > "target_object_locator": "@5", > "paused": 0, > "used_replica": 0, > "precalc_pgid": 0, > "snapid": "head", > "registered": "1" > } > ], > "pool_ops": [], > "pool_stat_ops": [], > "statfs_ops": [], > "command_ops": [] > } > > Greets, > Stefan > > Am 16.05.2017 um 15:44 schrieb Jason Dillaman: >> On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG >> <s.priebe@xxxxxxxxxxxx> wrote: >>> 3.) it still happens on pre jewel images even when they got restarted / >>> killed and reinitialized. In that case they've the asok socket available >>> for now. Should i issue any command to the socket to get log out of the >>> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled. >> >> The best option would be to run "gcore" against the running VM whose >> IO is stuck, compress the dump, and use the "ceph-post-file" to >> provide the dump. I could then look at all the Ceph data structures to >> hopefully find the issue. >> >> Enabling debug logs after the IO has stuck will most likely be of >> little value since it won't include the details of which IOs are >> outstanding. You could attempt to use "ceph --admin-daemon >> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just >> stuck waiting on an OSD to respond. >> -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com