Re: MDS damaged

Alessandro De Salvo <Alessandro.DeSalvo@xxxxxxxxxxxxx> · Wed, 11 Jul 2018 18:27:37 +0300



    Hi Gregory,
    thanks for the reply. I have the dump of the metadata pool, but
      I'm not sure what to check there. Is it what you mean?
    The cluster was operational until today at noon, when a full
      restart of the daemons was issued, like many other times in the
      past. I was trying to issue the repaired command to get a real
      error in the logs, but it was apparently not the case.
    Thanks,
    

        Alessandro

    
    Il 11/07/18 18:22, Gregory Farnum ha
      scritto:

    
      Have you checked the actual journal objects as the
        "journal export" suggested? Did you identify any actual source
        of the damage before issuing the "repaired" command?
        What is the history of the filesystems on this cluster?
      
      
        On Wed, Jul 11, 2018 at 8:10 AM Alessandro De
          Salvo <Alessandro.DeSalvo@xxxxxxxxxxxxx>
          wrote:

        
        Hi,

          
          after the upgrade to luminous 12.2.6 today, all our MDSes have
          been 

          marked as damaged. Trying to restart the instances only result
          in 

          standby MDSes. We currently have 2 filesystems active and 2
          MDSes each.

          
          I found the following error messages in the mon:

          
          mds.0 <node1_IP>:6800/2412911269 down:damaged

          mds.1 <node2_IP>:6800/830539001 down:damaged

          mds.0 <node3_IP>:6800/4080298733 down:damaged

          
          Whenever I try to force the repaired state with ceph mds
          repaired 

          <fs_name>:<rank> I get something like this in the
          MDS logs:

          
          2018-07-11 13:20:41.597970 7ff7e010e700  0
          mds.1.journaler.mdlog(ro) 

          error getting journal off disk

          2018-07-11 13:20:41.598173 7ff7df90d700 -1
          log_channel(cluster) log 

          [ERR] : Error recovering journal 0x201: (5) Input/output error

          
          Any attempt of running the journal export results in errors,
          like this one:

          
          cephfs-journal-tool --rank=cephfs:0 journal export backup.bin

          Error ((5) Input/output error)2018-07-11 17:01:30.631571
          7f94354fff00 -1 

          Header 200.00000000 is unreadable

          
          2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export:
          Journal not 

          readable, attempt object-by-object dump with `rados`

          
          Same happens for recover_dentries

          
          cephfs-journal-tool --rank=cephfs:0 event recover_dentries
          summary

          Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1
          Header 

          200.00000000 is unreadable

          Errors:

          0

          
          Is there something I could try to do to have the cluster back?

          
          I was able to dump the contents of the metadata pool with
          rados export 

          -p cephfs_metadata <filename> and I'm currently trying
          the procedure 

          described in 

          http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery
          

          but I'm not sure if it will work as it's apparently doing
          nothing at the 

          moment (maybe it's just very slow).

          
          Any help is appreciated, thanks!

          
               Alessandro

          
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com