Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Wed, 17 May 2017 07:17:06 +0200

Something I could do/test to find the bug?

Stefan
Excuse my typo sent from my mobile phone.

Am 16.05.2017 um 22:54 schrieb Jason Dillaman <jdillama@xxxxxxxxxx>:

It looks like it's just a ping message in that capture.

Are you saying that you restarted OSD 46 and the problem persisted?

On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
Hello,

while reproducing the problem, objecter_requests looks like this:

{
    "ops": [
        {
            "tid": 42029,
            "pg": "5.bd9616ad",
            "osd": 46,
            "object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
            "object_locator": "@5",
            "target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
            "target_object_locator": "@5",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "2.28854e+06s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "a07c2=[]",
            "mtime": "2017-05-16 21:53:22.0.069541s",
            "osd_ops": [
                "delete"
            ]
        }
    ],
    "linger_ops": [
        {
            "linger_id": 1,
            "pg": "5.5f3bd635",
            "osd": 17,
            "object_id": "rbd_header.e10ca56b8b4567",
            "object_locator": "@5",
            "target_object_id": "rbd_header.e10ca56b8b4567",
            "target_object_locator": "@5",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "snapid": "head",
            "registered": "1"
        }
    ],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}

Yes they've an established TCP connection. Qemu <=> osd.46. Attached is
a pcap file of the traffic between them when it got stuck.

Greets,
Stefan

Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
We've enabled the op tracker for performance reasons while using SSD
only storage ;-(

Disabled you mean?

Can enable the op tracker using ceph osd tell? Than reproduce the
problem. Check what has stucked again? Or should i generate an rbd log
from the client?

From a super-quick glance at the code, it looks like that isn't a
dynamic setting. Of course, it's possible that if you restart OSD 46
to enable the op tracker, the stuck op will clear itself and the VM
will resume. You could attempt to generate a gcore of OSD 46 to see if
information on that op could be extracted via the debugger, but no
guarantees.

You might want to verify that the stuck client and OSD 46 have an
actual established TCP connection as well before doing any further
actions.

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com