On Mon, Jun 29, 2020 at 9:22 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > On Sat, Jun 27, 2020 at 09:02:05PM +0800, Hillf Danton wrote: > > > > > > So, to hit this syzkaller one of these must have happened: > > > > > > 1) rdma_addr_cancel() didn't work and the process_one_work() is still > > > > > > runnable/running > > > > > > > > > > What syzbot reported indicates that the kworker did survive not only > > > > > canceling work but the handler_mutex, despite it's a sync cancel that > > > > > waits for the work to complete. > > > > > > > > The syzbot report doesn't confirm that the cancel work was actaully > > > > called. > > > > > > > > The most likely situation is that it was skipped because of the state > > > > mangling the patch fixes.. > > > > > > > > > > 2) The state changed away from RDMA_CM_ADDR_QUERY without doing > > > > > > rdma_addr_cancel() > > > > > > > > > > The cancel does cover the query state in the reported case, and have > > > > > difficult time working out what's in the patch below preventing the > > > > > work from going across the line the sync cancel draws. That's the > > > > > question we can revisit once there is a reproducer available. > > > > > > > > rdma-cm never seems to get reproducers from syzkaller > > > > > > +syzkaller mailing list > > > > > > Hi Jason, > > > > > > Wonder if there is some systematic issue. Let me double check. > > > > By scanning bugs at: > > https://syzkaller.appspot.com/upstream > > https://syzkaller.appspot.com/upstream/fixed > > > > I found a significant number of bugs that I would qualify as "rdma-cm" > > and that have reproducers. Here is an incomplete list (I did not get > > to the end): > > > > https://syzkaller.appspot.com/bug?id=b8febdb3c7c8c1f1b606fb903cee66b21b2fd02f > > https://syzkaller.appspot.com/bug?id=d5222b3e1659e0aea19df562c79f216515740daa > > https://syzkaller.appspot.com/bug?id=c600e111223ce0a20e5f2fb4e9a4ebdff54d7fa6 > > https://syzkaller.appspot.com/bug?id=a9796acbdecc1b2ba927578917755899c63c48af > > https://syzkaller.appspot.com/bug?id=95f89b8fb9fdc42e28ad586e657fea074e4e719b > > https://syzkaller.appspot.com/bug?id=8dc0bcd9dd6ec915ba10b3354740eb420884acaa > > https://syzkaller.appspot.com/bug?id=805ad726feb6910e35088ae7bbe61f4125e573b7 > > https://syzkaller.appspot.com/bug?id=56b60fb3340c5995373fe5b8eae9e8722a012fc4 > > https://syzkaller.appspot.com/bug?id=38d36d1b26b4299bf964d50af4d79688d39ab960 > > https://syzkaller.appspot.com/bug?id=25e00dd59f31783f233185cb60064b0ab645310f > > https://syzkaller.appspot.com/bug?id=2f38d7e5312fdd0acc979c5e26ef2ef8f3370996 > > > > Do you mean some specific subset of bugs by "rdma-cm"? If yes, what is > > that subset? > > The race condition bugs never seem to get reproducers, I checked a few > of the above and these are much more deterministic things. > > I think the recurrance rate for the races is probably too low? Yes, it definitely may depend on probability. There is usually a significant correlation with the number of crashes. This bug happened only once, that usually means either a very hard to trigger race condition, or a previous induced memory corruption. For harder to trigger race conditions, KCSAN (the data race detector) may help in future. However, kernel has too many races to report them to mailing lists: https://syzkaller.appspot.com/upstream?manager=ci2-upstream-kcsan-gce Though, some race conditions are manageable to trigger and I think we have hundreds of race conditions with reproducers on the dashboard.