On Thu, May 30, 2024 at 4:54 AM Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote: > > Hi Xaio, > > Sorry it took so long but I had a chance to dig into the bug today. It's > not what I had originally thought, but I do have a solution. > > Turns out the problem is that multiple SIGCHLD signals can be coalesced > into one signal if they happen at the same time between reads to the > signalfd. This is just the way Linux works and I didn't account for it > in the code. > > To fix this we, need to wait for multiple potential children being > completed after every SIGCHLD is received. > > I've made two patches which you can get from: > > https://github.com/lsgunth/mdadm/commits/write_zeros_sigbug/ > > I tested it with several hundred runs of your test script and it seems > to fix the problem. Please review and test for yourself. Hi Logan Thanks very much. I've tested more than 1000 times and it doesn't stuck anymore. > > On 2024-05-22 20:05, Xiao Ni wrote: > > I did a test in a simple c program. > > I made a similar test program to try it out and I think the reason it > wasn't working for you was due to the coalescing and simply blocking > solves the (now only theoretical) race at startup. Once the coalescing > problem is fixed we still need to move the block earlier to fix the > race. I've attached the code for that program if you want to try it out. It's the same resolution in the patches :) I tried in my c program and it worked well too. It's the coalescing problem. And yes, we need to block signal earlier (patch2). But for patch01, I still like the wstatus name rather than wst. > > Thanks for finding and triaging the bug! > > Logan Best Regards Xiao