Hello Goncalo, afaik the authoritative shard is concluded based on deep-scrub object checksums which was included in Hammer. Is this in-line with your experience? If yes, is there any other method of concluding for the auth shard besides object timestamps for ceph < jewel? Kostis On 13 September 2016 at 06:44, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: > Hi All... > > Just dropping a small email to share our experience on how to recover a pg > from a cephfs metadata pool. > > The reason why I am sharing this information is because the general > understanding on how to recover a pg (check [1]) relies on identifying > incorrect objects by comparing checksums between the different replicas. > This procedure can not be applied for inconsistent pgs in the cephfs > metadata pool because all the objects have zero size and the real core of > the information is stored as omap key/value pairs in the osds leveldb. > > As a pragmatic example, sometime ago, we had the following error: > > 2016-08-30 00:30:53.492626 osd.78 192.231.127.171:6828/6072 331 : cluster > [INF] 5.3d0 deep-scrub starts > 2016-08-30 00:30:54.276134 osd.78 192.231.127.171:6828/6072 332 : cluster > [ERR] 5.3d0 shard 78: soid 5:0bd6d154:::602.00000000:head omap_digest > 0xf3fdfd0c != best guess omap_digest 0x23b2eae0 from auth shard 49 > 2016-08-30 00:30:54.747795 osd.78 192.231.127.171:6828/6072 333 : cluster > [ERR] 5.3d0 deep-scrub 0 missing, 1 inconsistent objects > 2016-08-30 00:30:54.747801 osd.78 192.231.127.171:6828/6072 334 : cluster > [ERR] 5.3d0 deep-scrub 1 errors > > The acting osds for this pg were [78,59,49] and 78 was the primary. > > The error is telling us that there is a divergence between the digest for > the omap information on shard / osd 78 with respect to shard / osd 49. The > omap_digest is a calculated CRC32 of omap header & key/values. Also please > note that the log gives you a shard and a auth shard osd id. This is > important to understand how 'pg repair' works in this case. > > Another useful way to understand what is divergent, is to use 'rados > list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat' which I think is > available in Jewel releases (see the result of that command after this > email). That actually tells you in a nice and clean way what is the > problematic object, what is the source of the divergence, and which osd is > problematic. In our case, the tool confirms that there is a > omap_digest_mismatch and that osd 78 is the one which is different from the > other two. Please note that the information spit out by the command is the > result of the initial pg deep scrub. if you live with that error for some > time, and your logs rotate, you may have to do a manual deep-scrub on the pg > for that command to spit out some useful information again. > > If you actually want to understand the source of our divergence, you can go > through [2], where we found that osd.78 was missing about ~500 keys (we are > still in the process of understanding why that happened). > > Our fear was that, as commonly mentioned in many forums, a pg repair would > push the copies from the primary osd to its peers, leading, in our case, to > data corruption. > > However, going through the code and with the help of Brad Hubbard from RH, > we understood that a pg repair triggers the copy from the auth shard to the > problematic shard. Please note that the auth shard may not be the primary > osd. In our precise case, running a 'pg repair' resulted in an updated > object in osd.78 (which is the primary osds). The timestaps of the same > objects in the peers remain unchanged. We also collected the object list-map > before and after recovery and checked that all the previously missing keys > were now present. Again, if you want details, please check [2] > > Hope this is useful for others. > > Cheers > > Goncalo > > > [1] http://ceph.com/planet/ceph-manually-repair-object/ > > [2] http://tracker.ceph.com/issues/17177#change-78032 > > > # rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat > [ > { > "object": { > "name": "602.00000000", > "nspace": "", > "locator": "", > "snap": "head" > }, > "missing": false, > "stat_err": false, > "read_err": false, > "data_digest_mismatch": false, > "omap_digest_mismatch": true, > "size_mismatch": false, > "attr_mismatch": false, > "shards": [ > { > "osd": 49, > "missing": false, > "read_error": false, > "data_digest_mismatch": false, > "omap_digest_mismatch": false, > "size_mismatch": false, > "data_digest_mismatch_oi": false, > "omap_digest_mismatch_oi": false, > "size_mismatch_oi": false, > "size": 0, > "omap_digest": "0xaa3fd281", > "data_digest": "0xffffffff" > }, > { > "osd": 59, > "missing": false, > "read_error": false, > "data_digest_mismatch": false, > "omap_digest_mismatch": false, > "size_mismatch": false, > "data_digest_mismatch_oi": false, > "omap_digest_mismatch_oi": false, > "size_mismatch_oi": false, > "size": 0, > "omap_digest": "0xaa3fd281", > "data_digest": "0xffffffff" > }, > { > "osd": 78, > "missing": false, > "read_error": false, > "data_digest_mismatch": false, > "omap_digest_mismatch": true, > "size_mismatch": false, > "data_digest_mismatch_oi": false, > "omap_digest_mismatch_oi": false, > "size_mismatch_oi": false, > "size": 0, > "omap_digest": "0x7600bd9e", > "data_digest": "0xffffffff" > } > ] > } > ] > > > > > > > > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com