Hi All... Just dropping a small email to share our experience on how to recover a pg from a cephfs metadata pool. The reason why I am sharing this information is because the
general understanding on how to recover a pg (check [1]) relies on
identifying incorrect objects by comparing checksums between the
different replicas. This procedure can not be applied for
inconsistent pgs in the cephfs metadata pool because all the
objects have zero size and the real core of the information is
stored as omap key/value pairs in the osds leveldb. As a pragmatic example, sometime ago, we had the following error:
The acting osds for this pg were [78,59,49] and 78 was the
primary. The error is telling us that there is a divergence between the
digest for the omap information on shard / osd 78 with respect to
shard / osd 49. The omap_digest is a calculated CRC32 of omap
header & key/values. Also please note that the log gives you a
shard and a auth shard osd id. This is important to understand how
'pg repair' works in this case. Another useful way to understand what is divergent, is to use 'rados
list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat'
which I think is available in Jewel releases (see the result of
that command after this email). That actually tells you in a nice
and clean way what is the problematic object, what is the source
of the divergence, and which osd is problematic. In our case, the
tool confirms that there is a omap_digest_mismatch and that osd 78
is the one which is different from the other two. Please note that
the information spit out by the command is the result of the
initial pg deep scrub. if you live with that error for some time,
and your logs rotate, you may have to do a manual deep-scrub on
the pg for that command to spit out some useful information
again. If you actually want to understand the source of our divergence,
you can go through [2], where we found that osd.78 was missing
about ~500 keys (we are still in the process of understanding why
that happened). Our fear was that, as commonly mentioned in many forums, a pg repair would push the copies from the primary osd to its peers, leading, in our case, to data corruption. However, going through the code and with the help of Brad Hubbard from RH, we understood that a pg repair triggers the copy from the auth shard to the problematic shard. Please note that the auth shard may not be the primary osd. In our precise case, running a 'pg repair' resulted in an updated object in osd.78 (which is the primary osds). The timestaps of the same objects in the peers remain unchanged. We also collected the object list-map before and after recovery and checked that all the previously missing keys were now present. Again, if you want details, please check [2] Hope this is useful for others. Cheers Goncalo
[1] http://ceph.com/planet/ceph-manually-repair-object/ [2] http://tracker.ceph.com/issues/17177#change-78032
# rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat
-- Goncalo Borges Research Computing ARC Centre of Excellence for Particle Physics at the Terascale School of Physics A28 | University of Sydney, NSW 2006 T: +61 2 93511937 |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com