On Wed, 2021-04-28 at 08:39 +0200, Maurizio Lombardi wrote: > út 27. 4. 2021 v 22:02 odesílatel Martin Wilck <mwilck@xxxxxxxx> > napsal: > > The code doesn't use add_timer(), only mod_timer() and > > del_timer_sync(). And we didn't observe a crash upon add_timer(). > > What > > we observed was that a timer had been enqueued multiple times, and > > the > > kernel crashes in expire_timers()->detach_timer(), when it encounters > > an already detached entry in the timer list. > > How can a timer end up enqueued multiple times? I observed in a dump that this can happen. More precisely, I observed the following: [ 4389.978732] nvme nvme36: Successfully reconnected (1 attempt) ... [ 4441.445655] nvme nvme36: Successfully reconnected (1 attempt) ... [ 4510.891400] nvme nvme36: ANATT timeout, resetting controller. ... [ 4517.035411] general protection fault: 0000 [#1] SMP PTI (kernel crash) In the crash dump, I could see a anatt_timer belonging to nvme36 being pending in the timer list, with a remaining expiry time of 44s, suggesting it had been created around time 4441.4s. That means that at the time the ANATT timeout was seen (which corresponds to a timer added around 4389.9s), the timer must have been pending twice. (*) > It's safe to call mod_timer() against both an active or an inactive > timer > and mod_timer() is thread-safe also. > > IMO the problem is due to the fact that timer_setup() could end up > being called against > an active, pending timer. > timer_setup() doesn't take any lock and modifies the pprev pointer > and > the timer's flags Yes, that's what I think has happened. timer_setup() doesn't clear any pointers in the list of pending timers pointing to this entry. If the newly-initialized timer is then added with mod_timer(), it becomes linked in a second timer list. When the first one expires, the timer will be detached, but only from one of the lists it's pending in. In a scenario like the one we faced, this could actually happen multiple times. If the detached timer remains linked into a timer list, once that list is traversed, the kernel dereferences a pointer with value LIST_POISON2, and crashes. Anybody: correct me if this is nonsense. Best, Martin (*) Note that the crash had been caused by another wrongly linked anatt timer, not this one.