Recover pgs from cephfs metadata pool (sharing experience)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All...

Just dropping a small email to share our experience on how to recover a pg from a cephfs metadata pool.

The reason why I am sharing this information is because the general understanding on how to recover a pg (check [1]) relies on identifying incorrect objects by comparing checksums between the different replicas. This procedure can not be applied for inconsistent pgs in the cephfs metadata pool because all the objects have zero size and the real core of the information is stored as omap key/value pairs in the osds leveldb.

As a pragmatic example, sometime ago, we had the following error:

2016-08-30 00:30:53.492626 osd.78 192.231.127.171:6828/6072 331 : cluster [INF] 5.3d0 deep-scrub starts
2016-08-30 00:30:54.276134 osd.78 192.231.127.171:6828/6072 332 : cluster [ERR] 5.3d0 shard 78: soid 5:0bd6d154:::602.00000000:head omap_digest 0xf3fdfd0c != best guess omap_digest 0x23b2eae0 from auth shard 49
2016-08-30 00:30:54.747795 osd.78 192.231.127.171:6828/6072 333 : cluster [ERR] 5.3d0 deep-scrub 0 missing, 1 inconsistent objects
2016-08-30 00:30:54.747801 osd.78 192.231.127.171:6828/6072 334 : cluster [ERR] 5.3d0 deep-scrub 1 errors

The acting osds for this pg were [78,59,49] and 78 was the primary.

The error is telling us that there is a divergence between the digest for the omap information on shard / osd 78 with respect to shard / osd 49. The omap_digest is a calculated CRC32 of omap header & key/values. Also please note that the log gives you a shard and a auth shard osd id. This is important to understand how 'pg repair' works in this case.

Another useful way to understand what is divergent, is to use 'rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat' which I think is available in Jewel releases (see the result of that command after this email). That actually tells you in a nice and clean way what is the problematic object, what is the source of the divergence, and which osd is problematic. In our case, the tool confirms that there is a omap_digest_mismatch and that osd 78 is the one which is different from the other two. Please note that the information spit out by the command is the result of the initial pg deep scrub. if you live with that error for some time, and your logs rotate, you may have to do a manual deep-scrub on the pg for that command to spit out some useful information again. 

If you actually want to understand the source of our divergence, you can go through [2], where we found that osd.78 was missing about ~500 keys (we are still in the process of understanding why that happened).

Our fear was that, as commonly mentioned in many forums, a pg repair would push the copies from the primary osd to its peers, leading, in our case, to data corruption.

However, going through the code and with the help of Brad Hubbard from RH, we understood that a pg repair triggers the copy from the auth shard to the problematic shard. Please note that the auth shard may not be the primary osd. In our precise case, running a 'pg repair' resulted in an updated object in osd.78 (which is the primary osds). The timestaps of the same objects in the peers remain unchanged. We also collected the object list-map before and after recovery and checked that all the previously missing keys were now present. Again, if you want details, please check [2]

Hope this is useful for others.

Cheers

Goncalo


[1] http://ceph.com/planet/ceph-manually-repair-object/

[2] http://tracker.ceph.com/issues/17177#change-78032


# rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat
[
    {
        "object": {
            "name": "602.00000000",
            "nspace": "",
            "locator": "",
            "snap": "head"
        },
        "missing": false,
        "stat_err": false,
        "read_err": false,
        "data_digest_mismatch": false,
        "omap_digest_mismatch": true,
        "size_mismatch": false,
        "attr_mismatch": false,
        "shards": [
            {
                "osd": 49,
                "missing": false,
                "read_error": false,
                "data_digest_mismatch": false,
                "omap_digest_mismatch": false,
                "size_mismatch": false,
                "data_digest_mismatch_oi": false,
                "omap_digest_mismatch_oi": false,
                "size_mismatch_oi": false,
                "size": 0,
                "omap_digest": "0xaa3fd281",
                "data_digest": "0xffffffff"
            },
            {
                "osd": 59,
                "missing": false,
                "read_error": false,
                "data_digest_mismatch": false,
                "omap_digest_mismatch": false,
                "size_mismatch": false,
                "data_digest_mismatch_oi": false,
                "omap_digest_mismatch_oi": false,
                "size_mismatch_oi": false,
                "size": 0,
                "omap_digest": "0xaa3fd281",
                "data_digest": "0xffffffff"
            },
            {
                "osd": 78,
                "missing": false,
                "read_error": false,
                "data_digest_mismatch": false,
                "omap_digest_mismatch": true,
                "size_mismatch": false,
                "data_digest_mismatch_oi": false,
                "omap_digest_mismatch_oi": false,
                "size_mismatch_oi": false,
                "size": 0,
                "omap_digest": "0x7600bd9e",
                "data_digest": "0xffffffff"
            }
        ]
    }
]









-- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux