Re: Unable to mount an ext4 RAID6 array

Nathan Peterson <nathan@xxxxxxxxxxx> · Wed, 16 Jan 2019 10:48:46 -0800

Hello,

Long overdue update.  I confirmed(thanks to Ted) it was indeed a HW
issue.  Long story short, that issue is resolved and I am able to run
e2fsck.

The next issue I ran into was lack of swapfile space.  This was
causing the e2fsck to fail during the check(as expected).

I resolved this(so far) by increasing the swapfile size to 50GB.
sudo e2fsck -y -C 0 /dev/mapper/enc6 is the command I sent and it has
been running for 38days straight.
Currently the swapfile size is at 13.2GB and growing.

           Version : 1.2
     Creation Time : Sun Nov 26 23:03:26 2017
        Raid Level : raid6
        Array Size : 42975741952 (40984.86 GiB 44007.16 GB)
     Used Dev Size : 3906885632 (3725.90 GiB 4000.65 GB)
      Raid Devices : 13
     Total Devices : 13
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Jan  6 09:21:27 2019
             State : clean
    Active Devices : 13
   Working Devices : 13
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

ps -eo comm,tty | grep fsck
e2fsck          ?

ps -ef | grep fsck
root      1890     1  0  2018 ?        00:00:00 sudo e2fsck -y -C 0
/dev/mapper/enc6
root      1891  1890  0  2018 ?        02:01:24 e2fsck -y -C 0 /dev/mapper/enc6

These are found in the dmesg log and are rare occurrence:
[Jan16 00:14] INFO: task mandb:25013 blocked for more than 120 seconds.
[  +0.000001]       Tainted: G           OE    4.15.0-42-generic #45-Ubuntu
[  +0.000001] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000000] mandb           D    0 25013  25009 0x00000000
[  +0.000002] Call Trace:
[  +0.000005]  __schedule+0x291/0x8a0
[  +0.000002]  ? blk_queue_bio+0x32a/0x450
[  +0.000002]  ? bit_wait+0x60/0x60
[  +0.000001]  schedule+0x2c/0x80
[  +0.000002]  io_schedule+0x16/0x40
[  +0.000001]  bit_wait_io+0x11/0x60
[  +0.000001]  __wait_on_bit+0x4c/0x90
[  +0.000001]  ? submit_bio+0x73/0x140
[  +0.000001]  out_of_line_wait_on_bit+0x90/0xb0
[  +0.000003]  ? bit_waitqueue+0x40/0x40
[  +0.000001]  __wait_on_buffer+0x32/0x40
[  +0.000003]  __ext4_get_inode_loc+0x1b5/0x410
[  +0.000001]  ext4_iget+0x92/0xb90
[  +0.000002]  ? legitimize_path.isra.28+0x2e/0x60
[  +0.000001]  ext4_iget_normal+0x30/0x40
[  +0.000002]  ext4_lookup+0xf0/0x210
[  +0.000001]  path_openat+0xd30/0x1770
[  +0.000001]  ? pipe_wait+0xc0/0xc0
[  +0.000002]  do_filp_open+0x9b/0x110
[  +0.000001]  ? user_path_at_empty+0x36/0x40
[  +0.000001]  ? user_path_at_empty+0x36/0x40
[  +0.000002]  ? __check_object_size+0xaf/0x1b0
[  +0.000002]  ? __alloc_fd+0x46/0x170
[  +0.000002]  do_sys_open+0x1bb/0x2c0
[  +0.000001]  ? do_sys_open+0x1bb/0x2c0
[  +0.000002]  ? __put_cred+0x3d/0x50
[  +0.000001]  ? SyS_access+0x13d/0x230
[  +0.000002]  SyS_openat+0x14/0x20
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000002] RIP: 0033:0x7f28799c9cdd
[  +0.000000] RSP: 002b:00007ffcf9ce33c8 EFLAGS: 00000287 ORIG_RAX:
0000000000000101
[  +0.000001] RAX: ffffffffffffffda RBX: 00007ffcf9ce3670 RCX: 00007f28799c9cdd
[  +0.000001] RDX: 0000000000080000 RSI: 00007ffcf9ce3450 RDI: 00000000ffffff9c
[  +0.000001] RBP: 00007ffcf9ce3430 R08: 0000000000000000 R09: 00007ffcf9ce365f
[  +0.000000] R10: 0000000000000000 R11: 0000000000000287 R12: 0000000000000007
[  +0.000001] R13: 0000000000000000 R14: 00007ffcf9ce3450 R15: 0000000000000000

My question, Is it possible to see the progress or at least know this
is going somewhere positive?

Thanks
-Nathan

On Thu, Oct 18, 2018 at 5:18 PM Theodore Y. Ts'o <tytso@xxxxxxx> wrote:
>
> Hi,
>
> Sorry I didn't get back to you sooner.  This e-mail thread got lost in
> my inbox, so thanks for pinging me about it.
>
> These lines in the logs clearly show that it is a hardware problem.
> It could be an issue with the SATA controller, or cables, or even
> something in the motherboard.
>
> [  +0.000006] ata1: irq_stat 0x00400040, connection status changed
> [  +0.000004] ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }
> [  +0.000005] ata1: hard resetting link
> [  +5.634542] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  +0.001809] ata1.00: configured for UDMA/133
> [  +0.000003] ata1: EH complete
> [Sep13 19:47] ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800
>
> The following article (found via Google) on Serverfault might be
> helpful:
>
> https://serverfault.com/questions/749433/hard-resetting-link-exception-emask-0x50-sact-0x0-serr-0x4090800-action-0xe-froz
>
> Good luck,
>
>                                         - Ted