Re: After power outage, nearly all vm volumes corrupted and unmountable

Gary Molenkamp <molenkam@xxxxxx> · Fri, 6 Jul 2018 09:17:20 -0400



    Thank you Jason,  Not sure how I missed that step.

    
    On 2018-07-06 08:34 AM, Jason Dillaman
      wrote:

    
      There have been several similar reports on the
        mailing list about this [1][2][3][4] that are always a result of
        skipping step 6 from the Luminous upgrade guide [5]. The new
        (starting Luminous) 'profile rbd'-style caps are designed to try
        to simplify caps going forward [6]. 
        

        TL;DR: your Openstack CephX users need to have permission
          to blacklist dead clients that failed to properly release the
          exclusive lock.
          
            
            [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022278.html
            [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022694.html
            [3] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026496.html
            [4] https://www.spinics.net/lists/ceph-users/msg45665.html
            [5] http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken
            [6] http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication
            

        On Fri, Jul 6, 2018 at 7:55 AM Gary Molenkamp
          <molenkam@xxxxxx> wrote:

        
        Good morning
          all,

          
          After losing all power to our DC last night due to a storm,
          nearly all 

          of the volumes in our Pike cluster are unmountable.  Of the 30
          VMs in 

          use at the time, only one has been able to successfully mount
          and boot 

          from its rootfs.   We are using Ceph as the backend storage to
          cinder 

          and glance.  Any help or pointers to bring this back online
          would be 

          appreciated.

          
            What most of the volumes are seeing is

          
          [    2.622252] SGI XFS with ACLs, security attributes, no
          debug enabled

          [    2.629285] XFS (sda1): Mounting V5 Filesystem

          [    2.832223] sd 2:0:0:0: [sda] FAILED Result:
          hostbyte=DID_OK 

          driverbyte=DRIVER_SENSE

          [    2.838412] sd 2:0:0:0: [sda] Sense Key : Aborted Command
          [current]

          [    2.842383] sd 2:0:0:0: [sda] Add. Sense: I/O process
          terminated

          [    2.846152] sd 2:0:0:0: [sda] CDB: Write(10) 2a 00 00 80 2c
          19 00 04 

          00 00

          [    2.850146] blk_update_request: I/O error, dev sda, sector
          8399897

          
          or

          
          [    2.590178] EXT4-fs (vda1): INFO: recovery required on
          readonly 

          filesystem

          [    2.594319] EXT4-fs (vda1): write access will be enabled
          during recovery

          [    2.957742] print_req_error: I/O error, dev vda, sector
          227328

          [    2.962468] Buffer I/O error on dev vda1, logical block 0,
          lost async 

          page write

          [    2.967933] Buffer I/O error on dev vda1, logical block 1,
          lost async 

          page write

          [    2.973076] print_req_error: I/O error, dev vda, sector
          229384

          
          As a test for one of the less critical vms, I deleted the vm
          and mounted 

          the volume on the one VM I managed to start.  The results were
          not 

          promising:

          
          # dmesg |tail

          [    5.136862] type=1305 audit(1530847244.811:4):
          audit_pid=496 old=0 

          auid=4294967295 ses=4294967295
          subj=system_u:system_r:auditd_t:s0 res=1

          [    7.726331] nf_conntrack version 0.5.0 (65536 buckets,
          262144 max)

          [29374.967315] scsi 2:0:0:1: Direct-Access     QEMU     QEMU
          HARDDISK    

          2.5+ PQ: 0 ANSI: 5

          [29374.988104] sd 2:0:0:1: [sdb] 83886080 512-byte logical
          blocks: (42.9 

          GB/40.0 GiB)

          [29374.991126] sd 2:0:0:1: Attached scsi generic sg1 type 0

          [29374.995302] sd 2:0:0:1: [sdb] Write Protect is off

          [29374.997109] sd 2:0:0:1: [sdb] Mode Sense: 63 00 00 08

          [29374.997186] sd 2:0:0:1: [sdb] Write cache: enabled, read
          cache: 

          enabled, doesn't support DPO or FUA

          [29375.005968]  sdb: sdb1

          [29375.007746] sd 2:0:0:1: [sdb] Attached SCSI disk

          
          # parted /dev/sdb

          GNU Parted 3.1

          Using /dev/sdb

          Welcome to GNU Parted! Type 'help' to view a list of commands.

          (parted) p

          Model: QEMU QEMU HARDDISK (scsi)

          Disk /dev/sdb: 42.9GB

          Sector size (logical/physical): 512B/512B

          Partition Table: msdos

          Disk Flags:

          
          Number  Start   End     Size    Type     File system  Flags

            1      1049kB  42.9GB  42.9GB  primary  xfs          boot

          
          # mount -t xfs /dev/sdb temp

          mount: wrong fs type, bad option, bad superblock on /dev/sdb,

                  missing codepage or helper program, or other error

          
                  In some cases useful info is found in syslog - try

                  dmesg | tail or so.

          
          # xfs_repair /dev/sdb

          Phase 1 - find and verify superblock...

          bad primary superblock - bad magic number !!!

          
          attempting to find secondary superblock...

          
          Which eventually fails.   The ceph cluster looks healthy, I
          can export 

          the volumes from rbd.  I can find no other errors in ceph of
          openstack 

          indicating a fault in either system.

          
               - Is this recoverable?

          
               - What happened to all of these volumes and can this be
          prevented 

          from occurring again?  Note that any shutdown vm at the time
          of the 

          outage appears to be fine.

          
          Relevant versions:

          
               Base OS:  all Centos 7.5

          
               Ceph:  Luminous 12.2.5-0

          
               Openstack:  Latest Pike releases in
          centos-release-openstack-pike-1-1

          
                   nova 16.1.4-1

          
                   cinder  11.1.1-1

          
          -- 

          Gary Molenkamp                  Computer Science/Science
          Technology Services

          Systems Administrator           University of Western Ontario

          molenkam@xxxxxx                 http://www.csd.uwo.ca

          (519) 661-2111 x86882           (519) 661-3566

          
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
      -- 

      
                Jason
              
            
    -- 
Gary Molenkamp			Computer Science/Science Technology Services
Systems Administrator		University of Western Ontario
molenkam@xxxxxx                 http://www.csd.uwo.ca
(519) 661-2111 x86882		(519) 661-3566

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com