Re: raid6check extremely slow ?

Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> · Tue, 12 May 2020 20:16:27 +0200

On 5/12/20 6:07 PM, Piergiorgio Sartor wrote:
On Mon, May 11, 2020 at 11:07:31PM +0200, Guoqing Jiang wrote:
On 5/11/20 6:14 PM, Piergiorgio Sartor wrote:
On Mon, May 11, 2020 at 10:58:07AM +0200, Guoqing Jiang wrote:
Hi Wolfgang,

On 5/11/20 8:40 AM, Wolfgang Denk wrote:
Dear Guoqing Jiang,

In message<2cf55e5f-bdfb-9fef-6255-151e049ac0a1@xxxxxxxxxxxxxxx>  you wrote:
Seems raid6check is in 'D' state, what are the output of 'cat
/proc/19719/stack' and /proc/mdstat?
# for i in 1 2 3 4 ; do  cat /proc/19719/stack; sleep 2; echo ; done
[<0>] __wait_rcu_gp+0x10d/0x110
[<0>] synchronize_rcu+0x47/0x50
[<0>] mddev_suspend+0x4a/0x140
[<0>] suspend_lo_store+0x50/0xa0
[<0>] md_attr_store+0x86/0xe0
[<0>] kernfs_fop_write+0xce/0x1b0
[<0>] vfs_write+0xb6/0x1a0
[<0>] ksys_write+0x4f/0xc0
[<0>] do_syscall_64+0x5b/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[<0>] __wait_rcu_gp+0x10d/0x110
[<0>] synchronize_rcu+0x47/0x50
[<0>] mddev_suspend+0x4a/0x140
[<0>] suspend_lo_store+0x50/0xa0
[<0>] md_attr_store+0x86/0xe0
[<0>] kernfs_fop_write+0xce/0x1b0
[<0>] vfs_write+0xb6/0x1a0
[<0>] ksys_write+0x4f/0xc0
[<0>] do_syscall_64+0x5b/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[<0>] __wait_rcu_gp+0x10d/0x110
[<0>] synchronize_rcu+0x47/0x50
[<0>] mddev_suspend+0x4a/0x140
[<0>] suspend_hi_store+0x44/0x90
[<0>] md_attr_store+0x86/0xe0
[<0>] kernfs_fop_write+0xce/0x1b0
[<0>] vfs_write+0xb6/0x1a0
[<0>] ksys_write+0x4f/0xc0
[<0>] do_syscall_64+0x5b/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[<0>] __wait_rcu_gp+0x10d/0x110
[<0>] synchronize_rcu+0x47/0x50
[<0>] mddev_suspend+0x4a/0x140
[<0>] suspend_hi_store+0x44/0x90
[<0>] md_attr_store+0x86/0xe0
[<0>] kernfs_fop_write+0xce/0x1b0
[<0>] vfs_write+0xb6/0x1a0
[<0>] ksys_write+0x4f/0xc0
[<0>] do_syscall_64+0x5b/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Looks raid6check keeps writing suspend_lo/hi node which causes mddev_suspend
is called,
means synchronize_rcu and other synchronize mechanisms are triggered in the
path ...

Interesting, why is it in ksys_write / vfs_write / kernfs_fop_write
all the time?  I thought it was_reading_  the disks only?
I didn't read raid6check before, just find check_stripes has

      while (length > 0) {
              lock_stripe -> write suspend_lo/hi node
              ...
              unlock_all_stripes -> -> write suspend_lo/hi node
      }

I think it explains the stack of raid6check, and maybe it is way that
raid6check works, lock
stripe, check the stripe then unlock the stripe, just my guess ...
Hi again!

I made a quick test.
I disabled the lock / unlock in raid6check.

With lock / unlock, I get around 1.2MB/sec
per device component, with ~13% CPU load.
Wihtout lock / unlock, I get around 15.5MB/sec
per device component, with ~30% CPU load.

So, it seems the lock / unlock mechanism is
quite expensive.
Yes, since mddev_suspend/resume are triggered by the lock/unlock stripe.

I'm not sure what's the best solution, since
we still need to avoid race conditions.
I guess there are two possible ways:

1. Per your previous reply, only call raid6check when array is RO, then
we don't need the lock.

2. Investigate if it is possible that acquire stripe_lock in
suspend_lo/hi_store
to avoid the race between raid6check and write to the same stripe. IOW,
try fine grained protection instead of call the expensive suspend/resume
in suspend_lo/hi_store. But I am not sure it is doable or not right now.
Could you please elaborate on the
"fine grained protection" thing?

Even raid6check checks stripe and locks stripe one by one, but the thing
is different in kernel space, locking of one stripe triggers mddev_suspend
and mddev_resume which affect all stripes ...

If kernel can expose interface to actually locking one stripe, then 
raid6check
could use it to actually lock only one stripe (this is what I call fine 
grained)
instead of trigger suspend/resume which are time consuming.

BTW, seems there are build problems for raid6check ...

mdadm$ make raid6check
gcc -Wall -Werror -Wstrict-prototypes -Wextra -Wno-unused-parameter
-Wimplicit-fallthrough=0 -O2 -DSendmail=\""/usr/sbin/sendmail -t"\"
-DCONFFILE=\"/etc/mdadm.conf\" -DCONFFILE2=\"/etc/mdadm/mdadm.conf\"
-DMAP_DIR=\"/run/mdadm\" -DMAP_FILE=\"map\" -DMDMON_DIR=\"/run/mdadm\"
-DFAILED_SLOTS_DIR=\"/run/mdadm/failed-slots\" -DNO_COROSYNC -DNO_DLM
-DVERSION=\"4.1-74-g5cfb79d\" -DVERS_DATE="\"2020-04-27\"" -DUSE_PTHREADS
-DBINDIR=\"/sbin\"  -o sysfs.o -c sysfs.c
gcc -O2  -o raid6check raid6check.o restripe.o sysfs.o maps.o lib.o
xmalloc.o dlink.o
sysfs.o: In function `sysfsline':
sysfs.c:(.text+0x2adb): undefined reference to `parse_uuid'
sysfs.c:(.text+0x2aee): undefined reference to `uuid_zero'
sysfs.c:(.text+0x2af5): undefined reference to `uuid_zero'
collect2: error: ld returned 1 exit status
Makefile:220: recipe for target 'raid6check' failed
make: *** [raid6check] Error 1
I cannot see this problem.
I could compile without issue.
Maybe some library is missing somewhere,
but I'm not sure where.

Do you try with the fastest mdadm tree? But could be environment issue ...

Thanks,
Guoqing