Hello Jason, i'm happy to tell you that i've currently one VM where i can reproduce the problem. > The best option would be to run "gcore" against the running VM whose > IO is stuck, compress the dump, and use the "ceph-post-file" to > provide the dump. I could then look at all the Ceph data structures to > hopefully find the issue. I've saved the dump but it will contain sensitive informations. I won't upload it to a public server. I'll send you an private email with a private server to download the core dump. Thanks! > Enabling debug logs after the IO has stuck will most likely be of > little value since it won't include the details of which IOs are > outstanding. You could attempt to use "ceph --admin-daemon > /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just > stuck waiting on an OSD to respond. This is the output: # ceph --admin-daemon /var/run/ceph/ceph-client.admin.5295.140214539927552.asok objecter_requests { "ops": [ { "tid": 384632, "pg": "5.bd9616ad", "osd": 46, "object_id": "rbd_data.e10ca56b8b4567.000000000000311c", "object_locator": "@5", "target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "2.28554e+06s", "attempts": 1, "snapid": "head", "snap_context": "a07c2=[]", "mtime": "2017-05-16 21:03:22.0.196102s", "osd_ops": [ "delete" ] } ], "linger_ops": [ { "linger_id": 1, "pg": "5.5f3bd635", "osd": 17, "object_id": "rbd_header.e10ca56b8b4567", "object_locator": "@5", "target_object_id": "rbd_header.e10ca56b8b4567", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" } ], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } Greets, Stefan Am 16.05.2017 um 15:44 schrieb Jason Dillaman: > On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx> wrote: >> 3.) it still happens on pre jewel images even when they got restarted / >> killed and reinitialized. In that case they've the asok socket available >> for now. Should i issue any command to the socket to get log out of the >> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled. > > The best option would be to run "gcore" against the running VM whose > IO is stuck, compress the dump, and use the "ceph-post-file" to > provide the dump. I could then look at all the Ceph data structures to > hopefully find the issue. > > Enabling debug logs after the IO has stuck will most likely be of > little value since it won't include the details of which IOs are > outstanding. You could attempt to use "ceph --admin-daemon > /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just > stuck waiting on an OSD to respond. > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com