problem with recovered array

eyal@xxxxxxxxxxxxxx · Tue, 31 Oct 2023 00:35:44 +1100

F38

I know this is a bit long but I wanted to provide as much detail as I thought needed.

I have a 7-member raid6. The other day I needed to send a disk for replacement.
I have done this before and all looked well. The array is now degraded until I get the new disk.

At one point my system got into trouble and I am not sure why, but it started to have
very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
It refused to complete the shutdown and after a while I used the sysrq feature for this.

On the restart it dropped into emergency shell, the array had all members listed as spares.
I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
though the array was now listed in mdstat and looked as expected.

Since mdadm suggested I use --force', I did so
	mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1

Q0) was I correct to use this command?

2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127

Q1) What does this last line mean?

Now that the array came up I still could not mount the fs (still in emergency shell).
I rebooted and all came up, the array was there and the fs was mounted and so far
I did not notice any issues with the fs.

However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
SATA disk on the server and it went fast, so the USB disk is OK.

I then noticed (in 'top') that there is a kworker running at 100% CPU:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 944760 root      20   0       0      0      0 R 100.0   0.0 164:00.85 kworker/u16:3+flush-9:127

It did it for many hours and I do not know what it is doing.

Q2) what does this worker do?

I also noticed that mdstat shows a high bitmap usage:

Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
      58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
      bitmap: 87/88 pages [348KB], 65536KB chunk

Q3) Is this OK? Should the usage go down? It does not change at all.

While looking at everything, I started iostat on md127 and I see that there is a constant
trickle of writes, about 5KB/s. There is no activity on this fs.
Also, I see similar activity on all the members, at the same rate, so md127 does not show
6 times the members activity. I guess this is just how md works?

Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
	If this is a background thing, how can I monitor it?

Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
	What is the correct way to bring up a degraded array? What about the 'dirty' part?
'mdadm -D /dev/md127' mention'sync':
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       8       8       17        1      active sync   /dev/sdb1
       9       8       33        2      active sync   /dev/sdc1
       7       8       49        3      active sync   /dev/sdd1
       4       8       65        4      active sync   /dev/sde1
       5       8       81        5      active sync   /dev/sdf1
       6       8       97        6      active sync   /dev/sdg1
Is this related?

BTW I plan to run a 'check' at some point.

TIA

--
Eyal at Home (eyal@xxxxxxxxxxxxxx)