Hi Xiao, I don't have time to dig into this myself, but my guess would be that the signal for one of the children come too quickly, before the sigprocmask() call in wait_for_zero_forks(). Seems like SIGCHLD should be blocked before the first call to write_zeroes_fork(). I'm really not sure why I put in a block to SIGINT and then a block to SIGCHLD after the processes started. I suspect adding SIGCHLD to the sigprocmask in add_disks() and just removing the sigprocmask in write_zeroes_fork() might fix the issue. Thanks, Logan On 2024-05-21 01:05, Xiao Ni wrote: > Hi Logan > > I'm trying to fix errors of mdadm regression failures. There is a > failure in 00raid5-zero sometimes. I added some logs: > > In function write_zeroes_fork: > if (fallocate(fd, FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, > offset_bytes, sz)) { > pr_err("zeroing %s failed: %s\n", dv->devname, > strerror(errno)); > ret = 1; > break; > } else > printf("zeroing good\n"); > > In function wait_for_zero_forks: > if (fdsi.ssi_signo == SIGINT) { > printf("\n"); > pr_info("Interrupting zeroing processes, > please wait...\n"); > interrupted = true; > break; > } else if (fdsi.ssi_signo == SIGCHLD) { > printf("one child finishes, wait count %d\n", > wait_count); > if (!--wait_count) > break; > } > > while [ 1 ]; do > /usr/sbin/mdadm -CfR /dev/md0 -l 5 -n3 /dev/loop0 /dev/loop1 > /dev/loop2 --write-zeroes --auto=yes -v > mdadm --wait /dev/md0 > mdadm -Ss > sleep 1 > done > > zeroing good > zeroing good > zeroing good > one child finishes, wait count 3 > one child finishes, wait count 2 > > It looks like the farther process misses one child signal. > > root 174247 0.0 0.0 3628 2552 pts/0 S+ 02:52 0:00 | > \_ /usr/sbin/mdadm -CfR /dev/md0 -l 5 -n3 /dev/loop0 > /dev/loop1 /dev/loop2 --write-zeroes --auto=yes -v > root 174248 0.0 0.0 0 0 pts/0 Z+ 02:52 0:00 | > \_ [mdadm] <defunct> > root 174249 0.0 0.0 0 0 pts/0 Z+ 02:52 0:00 | > \_ [mdadm] <defunct> > root 174250 0.0 0.0 0 0 pts/0 Z+ 02:52 0:00 | > \_ [mdadm] <defunct> > > ]# cat /proc/174247/stack > [<0>] signalfd_dequeue+0x14d/0x170 > [<0>] signalfd_read_iter+0x7b/0xd0 > [<0>] vfs_read+0x201/0x330 > [<0>] ksys_read+0x5f/0xe0 > [<0>] do_syscall_64+0x7b/0x160 > [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Any ideas for this? > > Best Regards > Xiao >