Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Tue, 16 May 2017 22:02:59 +0200

Hello,

while reproducing the problem, objecter_requests looks like this:

{
    "ops": [
        {
            "tid": 42029,
            "pg": "5.bd9616ad",
            "osd": 46,
            "object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
            "object_locator": "@5",
            "target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
            "target_object_locator": "@5",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "2.28854e+06s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "a07c2=[]",
            "mtime": "2017-05-16 21:53:22.0.069541s",
            "osd_ops": [
                "delete"
            ]
        }
    ],
    "linger_ops": [
        {
            "linger_id": 1,
            "pg": "5.5f3bd635",
            "osd": 17,
            "object_id": "rbd_header.e10ca56b8b4567",
            "object_locator": "@5",
            "target_object_id": "rbd_header.e10ca56b8b4567",
            "target_object_locator": "@5",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "snapid": "head",
            "registered": "1"
        }
    ],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}

Yes they've an established TCP connection. Qemu <=> osd.46. Attached is
a pcap file of the traffic between them when it got stuck.

Greets,
Stefan

Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
> <s.priebe@xxxxxxxxxxxx> wrote:
>> We've enabled the op tracker for performance reasons while using SSD
>> only storage ;-(
> 
> Disabled you mean?
> 
>> Can enable the op tracker using ceph osd tell? Than reproduce the
>> problem. Check what has stucked again? Or should i generate an rbd log
>> from the client?
> 
> From a super-quick glance at the code, it looks like that isn't a
> dynamic setting. Of course, it's possible that if you restart OSD 46
> to enable the op tracker, the stuck op will clear itself and the VM
> will resume. You could attempt to generate a gcore of OSD 46 to see if
> information on that op could be extracted via the debugger, but no
> guarantees.
> 
> You might want to verify that the stuck client and OSD 46 have an
> actual established TCP connection as well before doing any further
> actions.
> 
Attachment:
osd.46_qemu_2.pcap.gz

Description: GNU Zip compressed data
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com