Re: I trashed my superblocks after reshape from raid5 to raid6 stalled - need help recovering

Patrik Dahlström <risca@xxxxxxxxxxxxxxx> · Wed, 10 Feb 2021 23:10:16 +0100

On 1/29/21 1:37 AM, Patrik Dahlström wrote:
> Hello,
> 
> Logs and disk information is located at the end of this email. Please
> note that I also have a USB stick plugged into this computer that
> sometimes comes up as sda and sometimes sdi, which means that some of
> the collected data might be off-by-one (sda -> sdb, etc.).
> 
> I will try to be as thorough as possible to explain what has happened
> and don't waste your time. The short version first:
> 
> * Start reshape of raid5 with 7 disks to raid6 with 8 disks
> * Reshape stalls
> * Panic
> * Fail to create overlays
> * Become overconfident
> * Overwrite superblock (wrongly) without overlays
> * Realize mistake
> * Stop
> * Get overlays working
> * Much hard thinking and experimenting with device mapper
> * Successfully mount raid volume by combining 2 overlay sets
> * Need help restoring array
> 
> If this is enough information, please skip to "Where I am now" below.
> For details on what I've written to my superblock, see "Frakking up".
> 
> Long version
> ============
> This story begins with a perfectly healthy raid5 array with 7 x 10 TB
> drives. Well, mostly healthy. I had started to see these lines pop up
> in my syslog:
> 
> Jan 21 18:01:06 rack-server-1 smartd[1586]: Device: /dev/sdb [SAT], 16 Currently unreadable (pending) sectors
> 
> Because of this, I started to become paranoid that I would loose data
> when replacing the bad drive. I decided I should add another 10 TB to
> the array and convert to raid6. These are the commands I used to kick
> off that conversion:
> 
> (mdadm 4.1 and Linux 4.15.0-132-generic)
> 
> $ sudo mdadm --add /dev/md0 /dev/sdg
> $ sudo mdadm --grow /dev/md0 --level=6 --raid-disk=8
> 
> This kicked off the reshape process successfully. A few days later, I
> started to notice I/O issues. More precisely: timeouts. It looks like
> the reshape process had stalled and any kind of I/O to the raid mount
> mount point would also stall, until some timeout error occurred. This
> was most likely caused by BBL, but I didn't know that at the time. At
> this point these messages started to show up in my kernel log:
> 
> Jan 20 21:55:06 rack-server-1 kernel: INFO: task md0_reshape:29278 blocked for more than 120 seconds.
> Jan 20 21:55:06 rack-server-1 kernel:       Tainted: G           OE    4.15.0-132-generic #136-Ubuntu
> Jan 20 21:55:06 rack-server-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 20 21:55:06 rack-server-1 kernel: md0_reshape     D    0 29278      2 0x80000000
> Jan 20 21:55:06 rack-server-1 kernel: Call Trace:
> Jan 20 21:55:06 rack-server-1 kernel:  __schedule+0x24e/0x880
> Jan 20 21:55:06 rack-server-1 kernel:  schedule+0x2c/0x80
> Jan 20 21:55:06 rack-server-1 kernel:  md_do_sync+0xdf1/0xfa0
> Jan 20 21:55:06 rack-server-1 kernel:  ? wait_woken+0x80/0x80
> Jan 20 21:55:06 rack-server-1 kernel:  ? __switch_to_asm+0x35/0x70
> Jan 20 21:55:06 rack-server-1 kernel:  md_thread+0x129/0x170
> Jan 20 21:55:06 rack-server-1 kernel:  ? md_seq_next+0x90/0x90
> Jan 20 21:55:06 rack-server-1 kernel:  ? md_thread+0x129/0x170
> Jan 20 21:55:06 rack-server-1 kernel:  kthread+0x121/0x140
> Jan 20 21:55:06 rack-server-1 kernel:  ? find_pers+0x70/0x70
> Jan 20 21:55:06 rack-server-1 kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
> Jan 20 21:55:06 rack-server-1 kernel:  ret_from_fork+0x35/0x40
> 
> Other tasks or user processes also started to become blocked. Almost
> anything I did would become blocked because it would access this mount
> point and stall. If I rebooted the server, it would stall during boot,
> when assembling the raid.
> 
> By removing all the drives, I was able to at least boot the server. I
> decided to update to Ubuntu 20.04 and try again - no dice. I still got
> blocked. I did notice that the reshape progressed a little bit every
> time I booted.
> 
> I figured I would revert the reshape and start from scratch and I found
> out that there is something called "--assemble --update=revert-reshape":
> 
> (mdadm v4.1 and Linux-5.4.0-64-generic, USB stick is sda)
> 
> $ sudo mdadm --detail /dev/md0
> /dev/md0:         
>            Version : 1.2
>      Creation Time : Sat Apr 29 16:21:11 2017
>         Raid Level : raid6
>         Array Size : 58597880832 (55883.29 GiB 60004.23 GB)
>      Used Dev Size : 9766313472 (9313.88 GiB 10000.70 GB)
>       Raid Devices : 8
>      Total Devices : 8
>        Persistence : Superblock is persistent
> 
>      Intent Bitmap : Internal
> 
>        Update Time : Thu Jan 21 20:32:24 2021
>              State : clean, degraded, reshaping 
>     Active Devices : 7
>    Working Devices : 8
>     Failed Devices : 0
>      Spare Devices : 1
> 
>             Layout : left-symmetric-6
>         Chunk Size : 512K
> 
> Consistency Policy : bitmap
> 
>     Reshape Status : 59% complete
>         New Layout : left-symmetric
> 
>               Name : rack-server-1:1  (local to host rack-server-1)
>               UUID : 7f289c7a:570e2f7e:2ac6f909:03b3970f
>             Events : 728221
> 
>     Number   Major   Minor   RaidDevice State
>        8       8       48        0      active sync   /dev/sdd
>       13       8       64        1      active sync   /dev/sde
>       12       8       96        2      active sync   /dev/sdg
>        7       8      128        3      active sync   /dev/sdi
>       10       8       80        4      active sync   /dev/sdf
>        9       8       16        5      active sync   /dev/sdb
>       11       8       32        6      active sync   /dev/sdc
>       14       8      112        7      spare rebuilding   /dev/sdh
> $ sudo mdadm --stop /dev/md0
> $ sudo mdadm --assemble --update=revert-reshape /dev/md0
> 
> This did not do what I expected. Unfortunately, I forgot to save the
> output of "mdadm --detail /dev/md0" after the last command, but if I
> remember correctly it marked all my drives, except sdh, as faulty. I
> expected it to start going backwards in the reshape progress.
> 
> At this point, I saved away the output of these commands:
> 
> (mdadm v4.1 and Linux-5.4.0-64-generic, USB stick is sda)
> 
> $ sudo mdadm --examine /dev/sdb
> $ sudo mdadm --examine /dev/sdc
> $ sudo mdadm --examine /dev/sdd
> $ sudo mdadm --examine /dev/sde
> $ sudo mdadm --examine /dev/sdf
> $ sudo mdadm --examine /dev/sdg
> $ sudo mdadm --examine /dev/sdh
> $ sudo mdadm --examine /dev/sdg
> $ sudo mdadm --examine /dev/sdh
> $ sudo mdadm --examine /dev/sdi
> 
> (output located at the end)
> 
> Fail to create overlays
> =======================
> I realized that I needed to start using overlays unless I mess up even
> more. However, that was easier said than done. No matter what I did, I
> always got this error as a result of "dmsetup create":
> 
> Jan 21 21:19:10 rack-server-1 kernel: device-mapper: table: 253:1: snapshot: Cannot get origin device
> Jan 21 21:19:10 rack-server-1 kernel: device-mapper: ioctl: error adding target to table
> 
> Frakking up
> ===========
> Now, what would be the sane thing to do when you can't create overlays?
> 
> Stop. Ask for help.
> 
> If this was a test on how I perform under pressure, I failed. After all,
> this wasn't my first time recovering from a failed reshape. Just search
> the mailing list for my name. I was confident in my abilities, and flew
> straight into the sun:
> 
> (mdadm v4.1 and Linux-5.4.0-64-generic, USB stick is sdi)
> 
> $ sudo mdadm --create --level=6 --raid-devices=8 --size=4883156736 /dev/md0 /dev/sdc /dev/sdd /dev/sdf /dev/sdh /dev/sde /dev/sda /dev/sdb missing
> 
> Notice the lack of "--assume-clean" and the wrong "--size" parameter,
> not to mention the missing "--data-offset" since it was "non-default".
> 
> This kicked off a rebuild of disk sdb (the last non-missing device).
> Fortunately, I realized my mistake within a few seconds - 39 seconds in
> fact, if my command history can be trusted - and stopped the array.
> 
> What follows is a series of more attempts at re-creating the superblock
> with different parameters to "mdadm --create --assume-clean". The last
> one (I think) being:
> 
> (mdadm v4.1 and Linux-5.4.0-64-generic, USB stick is sdi)
> 
> $ sudo mdadm --create --assume-clean --level=6 --raid-devices=8 --size=9766313472 --data-offset=61440 /dev/md0 /dev/sdc /dev/sdd /dev/sdf /dev/sdh /dev/sde /dev/sda /dev/sdb missing
> 
> Running "fsck.ext4 -n /dev/md0" on this array would at least start.
> However, it would eventually reach a point where it would start spewing
> a ton of errors. My guess is that it reaches the point where the array
> has not yet been reshaped.
> 
> Getting overlays to work again
> ==============================
> 
> Although my command history has no memory of it, "journalctl" tell me
> that I rebooted my server one more time after I failed to create
> overlays. After that, the "overlay_create" and "overlay_remove"
> functions just worked. Every. <censored>. Time.
> 
> Once overlays were working, I got to work at thinking hard and
> experimenting. Some experiments quickly grew the overlay files
> and my storage space for them were only ~80 GB. I decided to
> scrap the newly added disk and re-use it as storage space for
> overlay files. In hindsight, I realize that I could have used
> the other 10 TB drive I had laying on the shelf below...
> 
> Where I am now
> ==============
> 
> I am able to mount my raid volume by creating 2 separate sets of overlay
> files, create an array on each set, and then use device mapper in linear
> mode to "stitch together" the 2 arrays at the exact reshape position:
> 
> (mdadm v4.1 and Linux-5.4.0-64-generic)
> 
> $ sudo mdadm --create --assume-clean --level=6 --raid-devices=8 --size=9766313472 --layout=left-symmetric --data-offset=61440 /dev/md0 /dev/dm-3 /dev/dm-4 /dev/dm-6 /dev/dm-7 /dev/dm-5 /dev/dm-1 /dev/dm-2 missing
> $ sudo mdadm --create --assume-clean --level=6 --raid-devices=8 --size=9766313472 --layout=left-symmetric-6 --data-offset=123392 /dev/md1 /dev/dm-10 /dev/dm-11 /dev/dm-13 /dev/dm-14 /dev/dm-12 /dev/dm-8 /dev/dm-9 missing
> $ echo "0 69529589760 linear /dev/md0 0
> 69529589760 47666171904 linear /dev/md1 69529589760" | sudo dmsetup create joined
> $ sudo mount -o ro /dev/dm-15 /storage
> 
> The numbers are taken from the "mdadm -E <dev>" commands I ran earlier,
> only recalculated to fit the expected unit. The last drive in the array
> has been re-purposed as overlay storage.
> 
> What now?
> =========
> 
> This is where I need some more help:
> * How can I resume the reshape or otherwise fix my array?
> * Is resuming a reshape something that would be a useful feature?
>   If so, I could look into adding support for it. Maybe used like this?
> 
>   # mdadm --create --assume-clean /dev/md0 <array definition>
>   # mdadm --manage /dev/md0 --grow --reshape-pos=<number> <grow params>
> 
> * Does wiping or overwriting the superblock also clear the BBL?
> * Is there any information missing?

Update! I've fully recovered from my mishaps and successfully reshaped
my array from raid5 to raid6!

I modified mdadm so that I could set the proper bits and values in the
superblocks when creating my array. These were my final commands to get
my array running again:

$ sudo ./mdadm --create --assume-clean --level=6 --raid-devices=8 --data-offset=61440 --layout=left-symmetric --size=9766313472 --reshape-position=34764794880 --new-data-offset=246784 --new-layout=left-symmetric-6 /dev/md0 /dev/sdc /dev/sdd /dev/sdf /dev/sdh /dev/sde /dev/sda /dev/sdb missing
$ sudo mdadm --stop /dev/md0
$ sudo ./mdadm --assemble --update="revert-reshape" /dev/md0 /dev/sdc /dev/sdd /dev/sdf /dev/sdh /dev/sde /dev/sda /dev/sdb
$ sudo mount -o ro /dev/md0 /storage

This resumed my array reshape from raid5 to raid6. Once that had
completed, I added the final 8th disk and let it rebuild. The added
flags are "--reshape-position", "--new-data-offset", and "--new-layout".

Are these flags something that would be considered useful for mdadm?
If so, I could clean up the patches a bit and post them.

// Patrik