Trond, I'm looking at a commit from 2005: 0f9dc2b16884b ("RPC: Clean up socket autodisconnect") Cancel autodisconnect requests inside xprt_transmit() in order to avoid races. Use more efficient del_singleshot_timer_sync() I'm working on adding a "shutdown" state to timers, making it required for freeing the timer. This is to address the numerous bugs we hit where timers get rearmed just before freeing and then cause a crash in the timer code, without knowing what timer it was that caused it. Having a specific shutdown state for timers will remove this problem because if something tries to rearm a shutdown timer, it will fail and a WARN_ON_ONCE() is triggered. See below in the "reply" part for a description of this effort. The reason for this email, is because that WARN_ON_ONCE() triggered on the mod_timer() from: static void xprt_schedule_autodisconnect(struct rpc_xprt *xprt) __must_hold(&xprt->transport_lock) { xprt->last_used = jiffies; if (RB_EMPTY_ROOT(&xprt->recv_queue) && xprt_has_timer(xprt)) mod_timer(&xprt->timer, xprt->last_used + xprt->idle_timeout); } That's because xptr->timer was shutdown due to: int xprt_request_enqueue_receive(struct rpc_task *task) { [..] /* Turn off autodisconnect */ del_singleshot_timer_sync(&xprt->timer); return 0; } Now singleshot means just that. It's a single shot and calling del_singleshot_timer_sync() will shut it down so that it can be freed. That also means that it can no longer be re-armed. I'm not sure what you meant by "Use more efficient del_singleshot_timer_sync()" but I'm guessing since that was written in 2005, it is no longer relevant, and del_timer_sync() should now be used. After replacing that with del_timer_sync(), the warning goes away. I just want to confirm that's OK with you. Thanks, -- Steve On Thu, 27 Oct 2022 11:05:25 -0400 Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > Back in April, I posted an RFC patch set to help mitigate a common issue > where a timer gets armed just before it is freed, and when the timer > goes off, it crashes in the timer code without any evidence of who the > culprit was. I got side tracked and never finished up on that patch set. > Since this type of crash is still our #1 crash we are seeing in the field, > it has become a priority again to finish it. > > This is v2 of that patch set. Thomas Gleixner posted an untested version > that makes timer->function NULL as the flag that it is shutdown. I took that > code, tested it (fixed it up), added more comments, and changed the > name to del_timer_shutdown() as Linus had asked. I also converted it to use > WARN_ON_ONCE() instead of just WARN_ON() as Linus asked for that too. > > (Thomas, you never added a SoB, so I only added a link to your email > in that commit. But as this will likely go through your tree anyway, > I'm sure you'll have your SoB on all these). > > I then created a trivial coccinelle script to find where del_timer*() > is called before being freed, and converted them all to del_timer_shutdown() > (There was a couple that still used del_timer() instead of del_timer_sync()). > > I also updated DEBUG_OBJECTS_TIMERS to check from where the timer is ever > armed, to calling of del_timer_shutdown(), and it will trigger if a timer > is freed in between. The current way is to only check if the timer is armed, > but that means it only triggers if the race condition is hit, and with > experience, it's not run on enough machines to catch all of them. By triggering > it from the time the timer is armed to the time it is shutdown, it catches > all potential cases even if the race condition is not hit. > > I went though the result of the cocinelle script, and updated the locations. > Some locations were caught by DEBUG_OBJECTS_TIMERS as the coccinelle script > only checked for timers being freed in the same function as the del_timer*(). > > V1 is found here: https://lore.kernel.org/all/20220407161745.7d6754b3@xxxxxxxxxxxxxxxxxx/ > > Here's the original text of that version: > > [ > This is an RFC patch. As we hit a few bugs were del_timer() is called > instead of del_timer_sync() before the timer is freed, and there could > be bugs where even del_timer_sync() is used, but the timer gets rearmed, > I decided to introduce a "del_timer_free()" function that can be used > instead. This will at least educate developers on what to call before they > free a structure that holds a timer. > > In this RFC, I modified hci_qca.c as a use case, even though that change > needs some work, because the workqueue could still rearm it (I'm looking > to see if I can trigger the warning). > > If this approach is acceptable, then I will remove the hci_qca.c portion > from this patch, and create a series of patches to use the > del_timer_free() in all the locations in the kernel that remove the timer > before freeing. > ] > > We are hitting a common bug were a timer is being triggered after it is > freed. This causes a corruption in the timer link list and crashes the > kernel. Unfortunately it is not easy to know what timer it was that was > freed. Looking at the code, it appears that there are several cases that > del_timer() is used when del_timer_sync() should have been. > > Add a del_timer_free() that not only does a del_timer_sync() but will mark > the timer as freed in case it gets rearmed, it will trigger a WARN_ON. The > del_timer_free() is more likely to be used by developers that are about to > free a timer, then using del_timer_sync() as the latter is not as obvious > to being needed for freeing. Having the word "free" in the name of the > function will hopefully help developers know that that function needs to > be called before freeing. > > The added bonus is the marking of the timer as being freed such that it > will trigger a warning if it gets rearmed. At least that way if the system > crashes on a freed timer, at least we may see which timer it was that was > freed. >