Re: Major ceph disaster

Kevin Flöh <kevin.floeh@xxxxxxx> · Fri, 17 May 2019 14:15:59 +0200



    We tried to export the shards from the OSDs but there are only
      two shards left for each of the pgs, so we decided to give up
      these pgs. Will the files of these pgs be deleted from the mds or
      do we have to delete them manually. Is this the correct command to
      mark the pgs as lost:
    
      ceph pg {pg-id} mark_unfound_lost revert|delete
    
    Cheers,

      Kevin
    On 15.05.19 8:55 vorm., Kevin Flöh
      wrote:

    
    The hdds
      of OSDs 4 and 23 are completely lost, we cannot access them in any
      way. Is it possible to use the shards which are maybe stored on
      working OSDs as shown in the all_participants list?
      

      On 14.05.19 5:24 nachm., Dan van der Ster wrote:
      

      On Tue, May 14, 2019 at 5:13 PM Kevin Flöh
        <kevin.floeh@xxxxxxx> wrote:
        

        ok, so now we see at least a diffrence
          in the recovery state:
          

                "recovery_state": [
          

                    {
          

                        "name": "Started/Primary/Peering/Incomplete",
          

                        "enter_time": "2019-05-14 14:15:15.650517",
          

                        "comment": "not enough complete instances of
          this PG"
          

                    },
          

                    {
          

                        "name": "Started/Primary/Peering",
          

                        "enter_time": "2019-05-14 14:15:15.243756",
          

                        "past_intervals": [
          

                            {
          

                                "first": "49767",
          

                                "last": "59580",
          

                                "all_participants": [
          

                                    {
          

                                        "osd": 2,
          

                                        "shard": 0
          

                                    },
          

                                    {
          

                                        "osd": 4,
          

                                        "shard": 1
          

                                    },
          

                                    {
          

                                        "osd": 23,
          

                                        "shard": 2
          

                                    },
          

                                    {
          

                                        "osd": 24,
          

                                        "shard": 0
          

                                    },
          

                                    {
          

                                        "osd": 72,
          

                                        "shard": 1
          

                                    },
          

                                    {
          

                                        "osd": 79,
          

                                        "shard": 3
          

                                    }
          

                                ],
          

                                "intervals": [
          

                                    {
          

                                        "first": "59562",
          

                                        "last": "59563",
          

                                        "acting": "4(1),24(0),79(3)"
          

                                    },
          

                                    {
          

                                        "first": "59564",
          

                                        "last": "59567",
          

                                        "acting": "23(2),24(0),79(3)"
          

                                    },
          

                                    {
          

                                        "first": "59570",
          

                                        "last": "59574",
          

                                        "acting": "4(1),23(2),79(3)"
          

                                    },
          

                                    {
          

                                        "first": "59577",
          

                                        "last": "59580",
          

                                        "acting": "4(1),23(2),24(0)"
          

                                    }
          

                                ]
          

                            }
          

                        ],
          

                        "probing_osds": [
          

                            "2(0)",
          

                            "4(1)",
          

                            "23(2)",
          

                            "24(0)",
          

                            "72(1)",
          

                            "79(3)"
          

                        ],
          

                        "down_osds_we_would_probe": [],
          

                        "peering_blocked_by": []
          

                    },
          

                    {
          

                        "name": "Started",
          

                        "enter_time": "2019-05-14 14:15:15.243663"
          

                    }
          

                ],
          

          the peering does not seem to be blocked anymore. But still
          there is no
          

          recovery going on. Is there anything else we can try?
          

        What is the state of the hdd's which had osds 4 & 23?
        

        You may be able to use ceph-objectstore-tool to export those PG
        shards
        

        and import to another operable OSD.
        

        -- dan
        

          On 14.05.19 11:02 vorm., Dan van der Ster wrote:
          

          On Tue, May 14, 2019 at 10:59 AM Kevin
            Flöh <kevin.floeh@xxxxxxx> wrote:
            

            On 14.05.19 10:08 vorm., Dan van der
              Ster wrote:
              

              On Tue, May 14, 2019 at 10:02 AM Kevin Flöh
              <kevin.floeh@xxxxxxx> wrote:
              

              On 13.05.19 10:51 nachm., Lionel Bouton wrote:
              

              Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
              

              Dear ceph experts,
              

              [...] We have 4 nodes with 24 osds each and use 3+1
              erasure coding. [...]
              

              Here is what happened: One osd daemon could not be started
              and
              

              therefore we decided to mark the osd as lost and set it up
              from
              

              scratch. Ceph started recovering and then we lost another
              osd with
              

              the same behavior. We did the same as for the first osd.
              

              With 3+1 you only allow a single OSD failure per pg at a
              given time.
              

              You have 4096 pgs and 96 osds, having 2 OSD fail at the
              same time on 2
              

              separate servers (assuming standard crush rules) is a
              death sentence
              

              for the data on some pgs using both of those OSD (the ones
              not fully
              

              recovered before the second failure).
              

              OK, so the 2 OSDs (4,23) failed shortly one after the
              other but we think
              

              that the recovery of the first was finished before the
              second failed.
              

              Nonetheless, both problematic pgs have been on both OSDs.
              We think, that
              

              we still have enough shards left. For one of the pgs, the
              recovery state
              

              looks like this:
              

                     "recovery_state": [
              

                         {
              

                             "name":
              "Started/Primary/Peering/Incomplete",
              

                             "enter_time": "2019-05-09 16:11:48.625966",
              

                             "comment": "not enough complete instances
              of this PG"
              

                         },
              

                         {
              

                             "name": "Started/Primary/Peering",
              

                             "enter_time": "2019-05-09 16:11:48.611171",
              

                             "past_intervals": [
              

                                 {
              

                                     "first": "49767",
              

                                     "last": "59313",
              

                                     "all_participants": [
              

                                         {
              

                                             "osd": 2,
              

                                             "shard": 0
              

                                         },
              

                                         {
              

                                             "osd": 4,
              

                                             "shard": 1
              

                                         },
              

                                         {
              

                                             "osd": 23,
              

                                             "shard": 2
              

                                         },
              

                                         {
              

                                             "osd": 24,
              

                                             "shard": 0
              

                                         },
              

                                         {
              

                                             "osd": 72,
              

                                             "shard": 1
              

                                         },
              

                                         {
              

                                             "osd": 79,
              

                                             "shard": 3
              

                                         }
              

                                     ],
              

                                     "intervals": [
              

                                         {
              

                                             "first": "58860",
              

                                             "last": "58861",
              

                                             "acting":
              "4(1),24(0),79(3)"
              

                                         },
              

                                         {
              

                                             "first": "58875",
              

                                             "last": "58877",
              

                                             "acting":
              "4(1),23(2),24(0)"
              

                                         },
              

                                         {
              

                                             "first": "59002",
              

                                             "last": "59009",
              

                                             "acting":
              "4(1),23(2),79(3)"
              

                                         },
              

                                         {
              

                                             "first": "59010",
              

                                             "last": "59012",
              

                                             "acting":
              "2(0),4(1),23(2),79(3)"
              

                                         },
              

                                         {
              

                                             "first": "59197",
              

                                             "last": "59233",
              

                                             "acting":
              "23(2),24(0),79(3)"
              

                                         },
              

                                         {
              

                                             "first": "59234",
              

                                             "last": "59313",
              

                                             "acting":
              "23(2),24(0),72(1),79(3)"
              

                                         }
              

                                     ]
              

                                 }
              

                             ],
              

                             "probing_osds": [
              

                                 "2(0)",
              

                                 "4(1)",
              

                                 "23(2)",
              

                                 "24(0)",
              

                                 "72(1)",
              

                                 "79(3)"
              

                             ],
              

                             "down_osds_we_would_probe": [],
              

                             "peering_blocked_by": [],
              

                             "peering_blocked_by_detail": [
              

                                 {
              

                                     "detail":
              "peering_blocked_by_history_les_bound"
              

                                 }
              

                             ]
              

                         },
              

                         {
              

                             "name": "Started",
              

                             "enter_time": "2019-05-09 16:11:48.611121"
              

                         }
              

                     ],
              

              Is there a chance to recover this pg from the shards on
              OSDs 2, 72, 79?
              

              ceph pg repair/deep-scrub/scrub did not work.
              

              repair/scrub are not related to this problem so they won't
              help.
              

              How exactly did you use the
              osd_find_best_info_ignore_history_les option?
              

              One correct procedure would be to set it to true in
              ceph.conf, then
              

              restart each of the probing_osd's above.
              

              (Once the PG has peered, you need to unset the option and
              restart
              

              those osds again).
              

              We executed ceph --admin-daemon
              /var/run/ceph/ceph-osd.X.asok config set
              osd_find_best_info_ignore_history_les true
              

              And then we restarted the affected OSDs. I guess this is
              doing the same, right?
              

            No that doesn't work. That just sets it in memory but then
            the option
            

            is reset to the default when you restart the OSD.
            

            You need to set it in ceph.conf on the OSD host.
            

            -- dan
            

            We are also worried about the behind
              on trimming of the mds or is this
              

              not too problematic?
              

              Trimming requires IO on PGs, and the mds is almost
              certainly stuck on
              

              those incomplete PGs.
              

              Solve the incomplete, and then address the MDS later if it
              doesn't
              

              resolve itself.
              

              -- dan
              

              ok, then we don't have to worry about this for now.
              

              Best regards,
              

              Kevin
              

              MDS_TRIM 1 MDSs behind on trimming
              

                     mdsceph-node02.etp.kit.edu(mds.0): Behind on
              trimming (46178/128)
              

              max_segments: 128, num_segments: 46178
              

              Depending on the data stored (CephFS ?) you probably can
              recover most
              

              of it but some of it is irremediably lost.
              

              If you can recover the data from the failed OSD at the
              time they
              

              failed you might be able to recover some of your lost data
              (with the
              

              help of Ceph devs), if not there's nothing to do.
              

              In the later case I'd add a new server to use at least 3+2
              for a fresh
              

              pool instead of 3+1 and begin moving the data to it.
              

              The 12.2 + 13.2 mix is a potential problem in addition to
              the one
              

              above but it's a different one.
              

              Best regards,
              

              Lionel
              

              The idea for the future is to set up a new ceph with 3+2
              with 8 servers
              

              in total and of course with consistent versions on all
              nodes.
              

              Best regards,
              

              Kevin
              

              _______________________________________________
              

              ceph-users mailing list
              

              ceph-users@xxxxxxxxxxxxxx
              

              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
              

      _______________________________________________
      

      ceph-users mailing list
      

      ceph-users@xxxxxxxxxxxxxx
      

      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
      

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com