On Thu, May 16, 2019 at 9:45 AM Jane Chu <jane.chu@xxxxxxxxxx> wrote: > > Hi, > > I'm able to reproduce the panic below by running two sets of ndctl > commands that actually serve legitimate purpose in parallel (unlike > the brute force experiment earlier), each set in a indefinite loop. > This time it takes about an hour to panic. But I gather the cause > is probably the same: I've overlapped ndctl commands on the same > region. > > Could we add a check in nd_ioctl(), such that if there is > an ongoing ndctl command on a region, subsequent ndctl request > will fail immediately with something to the effect of EAGAIN? > The rationale being that kernel should protect itself against > user mistakes. We do already have locking in the driver to prevent configuration collisions. The problem looks to be broken assumptions about running the device unregistration path in a separate thread outside the lock. I suspect it may be incorrect assumptions about the userspace visibility of the device relative to teardown actions. To be clear this isn't the nd_ioctl() path this is the sysfs path. > Also, sensing the subject fix is for a different problem, and has been > verified, I'm happy to see it in upstream, so we have a better > code base to digger deeper in terms of how the destructive ndctl > commands interacts to typical mission critical applications, include > but not limited to rdma. Right, the crash signature you are seeing looks unrelated to the issue being address in these patches which is device-teardown racing active page pins. I'll start the investigation on the crash signature, but again I don't think it reads on this fix series.