Re: corrupted rbd filesystems since jewel

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 16 May 2017 16:54:09 -0400

It looks like it's just a ping message in that capture.

Are you saying that you restarted OSD 46 and the problem persisted?

On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
> Hello,
>
> while reproducing the problem, objecter_requests looks like this:
>
> {
>     "ops": [
>         {
>             "tid": 42029,
>             "pg": "5.bd9616ad",
>             "osd": 46,
>             "object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
>             "object_locator": "@5",
>             "target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
>             "target_object_locator": "@5",
>             "paused": 0,
>             "used_replica": 0,
>             "precalc_pgid": 0,
>             "last_sent": "2.28854e+06s",
>             "attempts": 1,
>             "snapid": "head",
>             "snap_context": "a07c2=[]",
>             "mtime": "2017-05-16 21:53:22.0.069541s",
>             "osd_ops": [
>                 "delete"
>             ]
>         }
>     ],
>     "linger_ops": [
>         {
>             "linger_id": 1,
>             "pg": "5.5f3bd635",
>             "osd": 17,
>             "object_id": "rbd_header.e10ca56b8b4567",
>             "object_locator": "@5",
>             "target_object_id": "rbd_header.e10ca56b8b4567",
>             "target_object_locator": "@5",
>             "paused": 0,
>             "used_replica": 0,
>             "precalc_pgid": 0,
>             "snapid": "head",
>             "registered": "1"
>         }
>     ],
>     "pool_ops": [],
>     "pool_stat_ops": [],
>     "statfs_ops": [],
>     "command_ops": []
> }
>
> Yes they've an established TCP connection. Qemu <=> osd.46. Attached is
> a pcap file of the traffic between them when it got stuck.
>
> Greets,
> Stefan
>
> Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
>> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
>> <s.priebe@xxxxxxxxxxxx> wrote:
>>> We've enabled the op tracker for performance reasons while using SSD
>>> only storage ;-(
>>
>> Disabled you mean?
>>
>>> Can enable the op tracker using ceph osd tell? Than reproduce the
>>> problem. Check what has stucked again? Or should i generate an rbd log
>>> from the client?
>>
>> From a super-quick glance at the code, it looks like that isn't a
>> dynamic setting. Of course, it's possible that if you restart OSD 46
>> to enable the op tracker, the stuck op will clear itself and the VM
>> will resume. You could attempt to generate a gcore of OSD 46 to see if
>> information on that op could be extracted via the debugger, but no
>> guarantees.
>>
>> You might want to verify that the stuck client and OSD 46 have an
>> actual established TCP connection as well before doing any further
>> actions.
>>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com