Perfect librbd log capture. I can see that a remove request to object rbd_data.e10ca56b8b4567.000000000000311c was issued but it never completed. This results in a hung discard and flush request. Assuming that object is still mapped to OSD 46, I think there is either something happening with that OSD or its backing disk that is preventing the remove for that object from completing. On Wed, May 17, 2017 at 6:04 AM, Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> wrote: > Hello Jason, > > attached is a rbd debug 20 log from the situation when the deadlock > occours. I hope this helps. > > The VM was running until 2017-05-17 12:10 but there was no I/O for 10 min. > > Greets, > Stefan > > Am 16.05.2017 um 22:54 schrieb Jason Dillaman: >> It looks like it's just a ping message in that capture. >> >> Are you saying that you restarted OSD 46 and the problem persisted? >> >> On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG >> <s.priebe@xxxxxxxxxxxx> wrote: >>> Hello, >>> >>> while reproducing the problem, objecter_requests looks like this: >>> >>> { >>> "ops": [ >>> { >>> "tid": 42029, >>> "pg": "5.bd9616ad", >>> "osd": 46, >>> "object_id": "rbd_data.e10ca56b8b4567.000000000000311c", >>> "object_locator": "@5", >>> "target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c", >>> "target_object_locator": "@5", >>> "paused": 0, >>> "used_replica": 0, >>> "precalc_pgid": 0, >>> "last_sent": "2.28854e+06s", >>> "attempts": 1, >>> "snapid": "head", >>> "snap_context": "a07c2=[]", >>> "mtime": "2017-05-16 21:53:22.0.069541s", >>> "osd_ops": [ >>> "delete" >>> ] >>> } >>> ], >>> "linger_ops": [ >>> { >>> "linger_id": 1, >>> "pg": "5.5f3bd635", >>> "osd": 17, >>> "object_id": "rbd_header.e10ca56b8b4567", >>> "object_locator": "@5", >>> "target_object_id": "rbd_header.e10ca56b8b4567", >>> "target_object_locator": "@5", >>> "paused": 0, >>> "used_replica": 0, >>> "precalc_pgid": 0, >>> "snapid": "head", >>> "registered": "1" >>> } >>> ], >>> "pool_ops": [], >>> "pool_stat_ops": [], >>> "statfs_ops": [], >>> "command_ops": [] >>> } >>> >>> Yes they've an established TCP connection. Qemu <=> osd.46. Attached is >>> a pcap file of the traffic between them when it got stuck. >>> >>> Greets, >>> Stefan >>> >>> Am 16.05.2017 um 21:45 schrieb Jason Dillaman: >>>> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG >>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>> We've enabled the op tracker for performance reasons while using SSD >>>>> only storage ;-( >>>> >>>> Disabled you mean? >>>> >>>>> Can enable the op tracker using ceph osd tell? Than reproduce the >>>>> problem. Check what has stucked again? Or should i generate an rbd log >>>>> from the client? >>>> >>>> From a super-quick glance at the code, it looks like that isn't a >>>> dynamic setting. Of course, it's possible that if you restart OSD 46 >>>> to enable the op tracker, the stuck op will clear itself and the VM >>>> will resume. You could attempt to generate a gcore of OSD 46 to see if >>>> information on that op could be extracted via the debugger, but no >>>> guarantees. >>>> >>>> You might want to verify that the stuck client and OSD 46 have an >>>> actual established TCP connection as well before doing any further >>>> actions. >>>> >> >> >> -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com