Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Tue, 16 May 2017 21:14:45 +0200

Hello Jason,

i'm happy to tell you that i've currently one VM where i can reproduce
the problem.

> The best option would be to run "gcore" against the running VM whose
> IO is stuck, compress the dump, and use the "ceph-post-file" to
> provide the dump. I could then look at all the Ceph data structures to
> hopefully find the issue.

I've saved the dump but it will contain sensitive informations. I won't
upload it to a public server. I'll send you an private email with a
private server to download the core dump. Thanks!

> Enabling debug logs after the IO has stuck will most likely be of
> little value since it won't include the details of which IOs are
> outstanding. You could attempt to use "ceph --admin-daemon
> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
> stuck waiting on an OSD to respond.

This is the output:
# ceph --admin-daemon
/var/run/ceph/ceph-client.admin.5295.140214539927552.asok objecter_requests
{
    "ops": [
        {
            "tid": 384632,
            "pg": "5.bd9616ad",
            "osd": 46,
            "object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
            "object_locator": "@5",
            "target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
            "target_object_locator": "@5",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "2.28554e+06s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "a07c2=[]",
            "mtime": "2017-05-16 21:03:22.0.196102s",
            "osd_ops": [
                "delete"
            ]
        }
    ],
    "linger_ops": [
        {
            "linger_id": 1,
            "pg": "5.5f3bd635",
            "osd": 17,
            "object_id": "rbd_header.e10ca56b8b4567",
            "object_locator": "@5",
            "target_object_id": "rbd_header.e10ca56b8b4567",
            "target_object_locator": "@5",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "snapid": "head",
            "registered": "1"
        }
    ],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}

Greets,
Stefan

Am 16.05.2017 um 15:44 schrieb Jason Dillaman:
> On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
> <s.priebe@xxxxxxxxxxxx> wrote:
>> 3.) it still happens on pre jewel images even when they got restarted /
>> killed and reinitialized. In that case they've the asok socket available
>> for now. Should i issue any command to the socket to get log out of the
>> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled.
> 
> The best option would be to run "gcore" against the running VM whose
> IO is stuck, compress the dump, and use the "ceph-post-file" to
> provide the dump. I could then look at all the Ceph data structures to
> hopefully find the issue.
> 
> Enabling debug logs after the IO has stuck will most likely be of
> little value since it won't include the details of which IOs are
> outstanding. You could attempt to use "ceph --admin-daemon
> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
> stuck waiting on an OSD to respond.
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com