Recover pgs from cephfs metadata pool (sharing experience)

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Tue, 13 Sep 2016 13:44:03 +1000



    Hi All...
    Just dropping a small email to share our experience on how to
      recover a pg from a cephfs metadata pool.
    The reason why I am sharing this information is because the
      general understanding on how to recover a pg (check [1]) relies on
      identifying incorrect objects by comparing checksums between the
      different replicas. This procedure can not be applied for
      inconsistent pgs in the cephfs metadata pool because all the
      objects have zero size and the real core of the information is
      stored as omap key/value pairs in the osds leveldb. 

    
    As a pragmatic example, sometime ago, we had the following error:
    
      2016-08-30 00:30:53.492626 osd.78 192.231.127.171:6828/6072 331
        : cluster [INF] 5.3d0 deep-scrub starts

        2016-08-30 00:30:54.276134 osd.78 192.231.127.171:6828/6072 332
        : cluster [ERR] 5.3d0 shard 78: soid
        5:0bd6d154:::602.00000000:head omap_digest 0xf3fdfd0c != best
        guess omap_digest 0x23b2eae0 from auth shard 49

        2016-08-30 00:30:54.747795 osd.78 192.231.127.171:6828/6072 333
        : cluster [ERR] 5.3d0 deep-scrub 0 missing, 1 inconsistent
        objects

        2016-08-30 00:30:54.747801 osd.78 192.231.127.171:6828/6072 334
        : cluster [ERR] 5.3d0 deep-scrub 1 errors
    
    The acting osds for this pg were [78,59,49] and 78 was the
      primary.

    
    The error is telling us that there is a divergence between the
      digest for the omap information on shard / osd 78 with respect to
      shard / osd 49. The omap_digest is a calculated CRC32 of omap
      header & key/values. Also please note that the log gives you a
      shard and a auth shard osd id. This is important to understand how
      'pg repair' works in this case.

    
    Another useful way to understand what is divergent, is to use 'rados
          list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat'
      which I think is available in Jewel releases (see the result of
      that command after this email). That actually tells you in a nice
      and clean way what is the problematic object, what is the source
      of the divergence, and which osd is problematic. In our case, the
      tool confirms that there is a omap_digest_mismatch and that osd 78
      is the one which is different from the other two. Please note that
      the information spit out by the command is the result of the
      initial pg deep scrub. if you live with that error for some time,
      and your logs rotate, you may have to do a manual deep-scrub on
      the pg for that command to spit out some useful information
      again.  

    
    If you actually want to understand the source of our divergence,
      you can go through [2], where we found that osd.78 was missing
      about ~500 keys (we are still in the process of understanding why
      that happened).

    
    Our fear was that, as commonly mentioned in many forums, a pg
      repair would push the copies from the primary osd to its peers,
      leading, in our case, to data corruption.
    However, going through the code and with the help of Brad Hubbard
      from RH, we understood that a pg repair triggers the copy from the
      auth shard to the problematic shard. Please note that the auth
      shard may not be the primary osd. In our precise case, running a
      'pg repair' resulted in an updated object in osd.78 (which is the
      primary osds). The timestaps of the same objects in the peers
      remain unchanged. We also collected the object list-map before and
      after recovery and checked that all the previously missing keys
      were now present. Again, if you want details, please check [2]
    Hope this is useful for others.

    
    Cheers
    Goncalo

    
    [1] http://ceph.com/planet/ceph-manually-repair-object/
    [2] http://tracker.ceph.com/issues/17177#change-78032

    
    # rados list-inconsistent-obj 5.3d0 | /usr/bin/json_reformat

      [

          {

              "object": {

                  "name": "602.00000000",

                  "nspace": "",

                  "locator": "",

                  "snap": "head"

              },

              "missing": false,

              "stat_err": false,

              "read_err": false,

              "data_digest_mismatch": false,

              "omap_digest_mismatch": true,

              "size_mismatch": false,

              "attr_mismatch": false,

              "shards": [

                  {

                      "osd": 49,

                      "missing": false,

                      "read_error": false,

                      "data_digest_mismatch": false,

                      "omap_digest_mismatch": false,

                      "size_mismatch": false,

                      "data_digest_mismatch_oi": false,

                      "omap_digest_mismatch_oi": false,

                      "size_mismatch_oi": false,

                      "size": 0,

                      "omap_digest": "0xaa3fd281",

                      "data_digest": "0xffffffff"

                  },

                  {

                      "osd": 59,

                      "missing": false,

                      "read_error": false,

                      "data_digest_mismatch": false,

                      "omap_digest_mismatch": false,

                      "size_mismatch": false,

                      "data_digest_mismatch_oi": false,

                      "omap_digest_mismatch_oi": false,

                      "size_mismatch_oi": false,

                      "size": 0,

                      "omap_digest": "0xaa3fd281",

                      "data_digest": "0xffffffff"

                  },

                  {

                      "osd": 78,

                      "missing": false,

                      "read_error": false,

                      "data_digest_mismatch": false,

                      "omap_digest_mismatch": true,

                      "size_mismatch": false,

                      "data_digest_mismatch_oi": false,

                      "omap_digest_mismatch_oi": false,

                      "size_mismatch_oi": false,

                      "size": 0,

                      "omap_digest": "0x7600bd9e",

                      "data_digest": "0xffffffff"

                  }

              ]

          }

      ]
    

    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com