Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Wed, 17 May 2017 17:42:23 +0200

Can Test in 2 hours but it sounds like http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014773.html

Stefan
Excuse my typo sent from my mobile phone.

Am 17.05.2017 um 17:05 schrieb Jason Dillaman <jdillama@xxxxxxxxxx>:

OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is
waiting for a scrub. What happens if you run "rados -p <rbd pool> rm
rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs
during this)? If that succeeds while your VM remains blocked on that
remove op, it looks like there is some problem in the OSD where ops
queued on a scrub are not properly awoken when the scrub completes.

On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
Hello Jason,

after enabling the log and generating a gcore dump, the request was
successful ;-(

So the log only contains the successfull request. So i was only able to
catch the successful request. I can send you the log on request.

Luckily i had another VM on another Cluster behaving the same.

This time osd.23:
# ceph --admin-daemon
/var/run/ceph/ceph-client.admin.22969.140085040783360.asok
objecter_requests
{
    "ops": [
        {
            "tid": 18777,
            "pg": "2.cebed0aa",
            "osd": 23,
            "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
            "object_locator": "@2",
            "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
            "target_object_locator": "@2",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "1.83513e+06s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "28a43=[]",
            "mtime": "2017-05-17 16:51:06.0.455475s",
            "osd_ops": [
                "delete"
            ]
        }
    ],
    "linger_ops": [
        {
            "linger_id": 1,
            "pg": "2.f0709c34",
            "osd": 23,
            "object_id": "rbd_header.21aafa6b8b4567",
            "object_locator": "@2",
            "target_object_id": "rbd_header.21aafa6b8b4567",
            "target_object_locator": "@2",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "snapid": "head",
            "registered": "1"
        }
    ],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}

OSD Logfile of OSD 23 attached.

Greets,
Stefan

Am 17.05.2017 um 16:26 schrieb Jason Dillaman:
On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
You mean the request no matter if it is successful or not? Which log
level should be set to 20?

I'm hoping you can re-create the hung remove op when OSD logging is
increased -- "debug osd = 20" would be nice if you can turn it up that
high while attempting to capture the blocked op.

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com