Re: Recover pgs from cephfs metadata pool (sharing experience)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Goncalo,
afaik the authoritative shard is concluded based on deep-scrub object
checksums which was included in Hammer. Is this in-line with your
experience? If yes, is there any other method of concluding for the
auth shard besides object timestamps for ceph < jewel?

Kostis

On 13 September 2016 at 06:44, Goncalo Borges
<goncalo.borges@xxxxxxxxxxxxx> wrote:
> Hi All...
>
> Just dropping a small email to share our experience on how to recover a pg
> from a cephfs metadata pool.
>
> The reason why I am sharing this information is because the general
> understanding on how to recover a pg (check [1]) relies on identifying
> incorrect objects by comparing checksums between the different replicas.
> This procedure can not be applied for inconsistent pgs in the cephfs
> metadata pool because all the objects have zero size and the real core of
> the information is stored as omap key/value pairs in the osds leveldb.
>
> As a pragmatic example, sometime ago, we had the following error:
>
> 2016-08-30 00:30:53.492626 osd.78 192.231.127.171:6828/6072 331 : cluster
> [INF] 5.3d0 deep-scrub starts
> 2016-08-30 00:30:54.276134 osd.78 192.231.127.171:6828/6072 332 : cluster
> [ERR] 5.3d0 shard 78: soid 5:0bd6d154:::602.00000000:head omap_digest
> 0xf3fdfd0c != best guess omap_digest 0x23b2eae0 from auth shard 49
> 2016-08-30 00:30:54.747795 osd.78 192.231.127.171:6828/6072 333 : cluster
> [ERR] 5.3d0 deep-scrub 0 missing, 1 inconsistent objects
> 2016-08-30 00:30:54.747801 osd.78 192.231.127.171:6828/6072 334 : cluster
> [ERR] 5.3d0 deep-scrub 1 errors
>
> The acting osds for this pg were [78,59,49] and 78 was the primary.
>
> The error is telling us that there is a divergence between the digest for
> the omap information on shard / osd 78 with respect to shard / osd 49. The
> omap_digest is a calculated CRC32 of omap header & key/values. Also please
> note that the log gives you a shard and a auth shard osd id. This is
> important to understand how 'pg repair' works in this case.
>
> Another useful way to understand what is divergent, is to use 'rados
> list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat' which I think is
> available in Jewel releases (see the result of that command after this
> email). That actually tells you in a nice and clean way what is the
> problematic object, what is the source of the divergence, and which osd is
> problematic. In our case, the tool confirms that there is a
> omap_digest_mismatch and that osd 78 is the one which is different from the
> other two. Please note that the information spit out by the command is the
> result of the initial pg deep scrub. if you live with that error for some
> time, and your logs rotate, you may have to do a manual deep-scrub on the pg
> for that command to spit out some useful information again.
>
> If you actually want to understand the source of our divergence, you can go
> through [2], where we found that osd.78 was missing about ~500 keys (we are
> still in the process of understanding why that happened).
>
> Our fear was that, as commonly mentioned in many forums, a pg repair would
> push the copies from the primary osd to its peers, leading, in our case, to
> data corruption.
>
> However, going through the code and with the help of Brad Hubbard from RH,
> we understood that a pg repair triggers the copy from the auth shard to the
> problematic shard. Please note that the auth shard may not be the primary
> osd. In our precise case, running a 'pg repair' resulted in an updated
> object in osd.78 (which is the primary osds). The timestaps of the same
> objects in the peers remain unchanged. We also collected the object list-map
> before and after recovery and checked that all the previously missing keys
> were now present. Again, if you want details, please check [2]
>
> Hope this is useful for others.
>
> Cheers
>
> Goncalo
>
>
> [1] http://ceph.com/planet/ceph-manually-repair-object/
>
> [2] http://tracker.ceph.com/issues/17177#change-78032
>
>
> # rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat
> [
>     {
>         "object": {
>             "name": "602.00000000",
>             "nspace": "",
>             "locator": "",
>             "snap": "head"
>         },
>         "missing": false,
>         "stat_err": false,
>         "read_err": false,
>         "data_digest_mismatch": false,
>         "omap_digest_mismatch": true,
>         "size_mismatch": false,
>         "attr_mismatch": false,
>         "shards": [
>             {
>                 "osd": 49,
>                 "missing": false,
>                 "read_error": false,
>                 "data_digest_mismatch": false,
>                 "omap_digest_mismatch": false,
>                 "size_mismatch": false,
>                 "data_digest_mismatch_oi": false,
>                 "omap_digest_mismatch_oi": false,
>                 "size_mismatch_oi": false,
>                 "size": 0,
>                 "omap_digest": "0xaa3fd281",
>                 "data_digest": "0xffffffff"
>             },
>             {
>                 "osd": 59,
>                 "missing": false,
>                 "read_error": false,
>                 "data_digest_mismatch": false,
>                 "omap_digest_mismatch": false,
>                 "size_mismatch": false,
>                 "data_digest_mismatch_oi": false,
>                 "omap_digest_mismatch_oi": false,
>                 "size_mismatch_oi": false,
>                 "size": 0,
>                 "omap_digest": "0xaa3fd281",
>                 "data_digest": "0xffffffff"
>             },
>             {
>                 "osd": 78,
>                 "missing": false,
>                 "read_error": false,
>                 "data_digest_mismatch": false,
>                 "omap_digest_mismatch": true,
>                 "size_mismatch": false,
>                 "data_digest_mismatch_oi": false,
>                 "omap_digest_mismatch_oi": false,
>                 "size_mismatch_oi": false,
>                 "size": 0,
>                 "omap_digest": "0x7600bd9e",
>                 "data_digest": "0xffffffff"
>             }
>         ]
>     }
> ]
>
>
>
>
>
>
>
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux