Re: MDS damaged

Daniel Davidson <danield@xxxxxxxxxxxxxxxx> · Wed, 25 Oct 2017 00:03:53 -0500



    This finally finished:

      
      2017-10-24 22:50:11.766519 7f775e539bc0  1 scavenge_dentries: frag
      607.00000000 is corrupt, overwriting

      Events by type:

        OPEN: 5640344

        SESSION: 10

        SUBTREEMAP: 8070

        UPDATE: 1384964

      Errors: 0

      
      I truncated:

      #cephfs-journal-tool journal reset

      old journal was 6255163020467~8616264519

      new journal start will be 6263781982208 (2697222 bytes past old
      end)

      writing journal head

      writing EResetJournal entry

      done

      
      I reset sessions:

      # cephfs-table-tool all reset session

{                                                                                                                                                                                                                 
      

          "0":
{                                                                                                                                                                                                        
      

              "data": {},

              "result": 0

          }

      }

      
      I marked it repaired:

      
      #ceph mds repaired 0

      
      And I still got errors as show from ceph -w:

      2017-10-25 00:02:08.929404 mds.0 [ERR] dir 607 object missing on
      disk; some files may be lost (~mds0/stray7)

      2017-10-25 00:02:09.099472 mon.0 [INF] mds.0
      172.16.31.1:6800/3462673422 down:damaged

      2017-10-25 00:02:09.105643 mon.0 [INF] fsmap e121619: 0/1/1 up, 1
      damaged

      2017-10-25 00:02:10.182101 mon.0 [INF] mds.?
      172.16.31.1:6809/2991612296 up:boot

      2017-10-25 00:02:10.182189 mon.0 [INF] fsmap e121620: 0/1/1 up, 1
      up:standby, 1 damaged

      
      What should I do next? ceph fs reset igbhome scares me.

      
      Dan

      
      On 10/24/2017 09:25 PM, Daniel Davidson wrote:

    
      Out of desperation, I started with
        the disaster recovery guide:

        
        http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

        
        After exporting the journal, I started doing:

        
        cephfs-journal-tool event recover_dentries summary

        
        And that was about 7 hours ago, and it is still
          running.  I am getting a lot of messages like:

          
          2017-10-24 21:24:10.910489 7f775e539bc0  1 scavenge_dentries:
          frag 607.00000000 is corrupt, overwriting

          
          The frag number is the same for every line and there have been
          thousands.

          
          I really could use some assistance,

          
          Dan

          
        On 10/24/2017 12:14 PM, Daniel Davidson wrote:

      
      Our
        ceph system is having a problem. 

        
        A few days a go we had a pg that was marked as inconsistent, and
        today I fixed it with a: 

        
        #ceph pg repair 1.37c 

        
        then a file was stuck as missing so I did a: 

        
        #ceph pg 1.37c mark_unfound_lost delete 

        pg has 1 objects unfound and apparently lost marking 

        
        That fixed the unfound file problem and all the pgs went
        active+clean.  A few minutes later though, the FS seemed to
        pause and the MDS started giving errors. 

        
        # ceph -w 

            cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77 

             health HEALTH_ERR 

                    mds rank 0 is damaged 

                    mds cluster is degraded 

                    noscrub,nodeep-scrub flag(s) set 

             monmap e3: 4 mons at
{ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0}

                    election epoch 652, quorum 0,1,2,3
        ceph-0,ceph-1,ceph-2,ceph-3 

              fsmap e121409: 0/1/1 up, 4 up:standby, 1 damaged 

             osdmap e35220: 32 osds: 32 up, 32 in 

                    flags
        noscrub,nodeep-scrub,sortbitwise,require_jewel_osds 

              pgmap v28398840: 1536 pgs, 2 pools, 795 TB data, 329
        Mobjects 

                    1595 TB used, 1024 TB / 2619 TB avail 

                        1536 active+clean 

        
        Looking into the logs when I try a: 

        
        #ceph mds repaired 0 

        
        2017-10-24 12:01:27.354271 mds.0 172.16.31.3:6801/1949050374 75
        : cluster [ERR] dir 607 object missing on disk; some files may
        be lost (~mds0/stray7) 

        
        Any ideas as for what to do next, I am stumped. 

        
        Dan 

        
        _______________________________________________ 

        ceph-users mailing list 

        ceph-users@xxxxxxxxxxxxxx
        

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com