Apologies for the delayed reply, I was away for most of last week; On 16/04/24 17:01, Eric Dumazet wrote: > On Mon, Apr 15, 2024 at 4:33 PM Valentin Schneider <vschneid@xxxxxxxxxx> wrote: >> >> On 15/04/24 14:35, Eric Dumazet wrote: >> > On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@xxxxxxxxxx> wrote: >> >> v4 -> v5 >> >> ++++++++ >> >> >> >> o Rebased against latest Linus' tree >> >> o Converted tw_timer into a delayed work following Jakub's bug report on v4 >> >> http://lore.kernel.org/r/20240411100536.224fa1e7@xxxxxxxxxx >> > >> > What was the issue again ? >> > >> > Please explain precisely why it was fundamentally tied to the use of >> > timers (and this was not possible to fix the issue without >> > adding work queues and more dependencies to TCP stack) >> >> In v4 I added the use of the ehash lock to serialize arming the timewait >> timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()). >> >> Unfortunately, holding a lock both in a timer callback and in the context >> in which it is destroyed is invalid. AIUI the issue is as follows: >> >> CPUx CPUy >> spin_lock(foo); >> <timer fires> >> call_timer_fn() >> spin_lock(foo) // blocks >> timer_shutdown_sync() >> __timer_delete_sync() >> __try_to_del_timer_sync() // looped as long as timer is running >> <deadlock> >> >> In our case, we had in v4: >> >> inet_twsk_deschedule_put() >> spin_lock(ehash_lock); >> tw_timer_handler() >> inet_twsk_kill() >> spin_lock(ehash_lock); >> __inet_twsk_kill(); >> timer_shutdown_sync(&tw->tw_timer); >> >> The fix here is to move the timer deletion to a non-timer >> context. Workqueues fit the bill, and as the tw_timer_handler() would just queue >> a work item, I converted it to a delayed_work. > > I do not like this delayed work approach. > > Adding more dependencies to the TCP stack is not very nice from a > maintainer point of view. > > Why couldn't you call timer_shutdown_sync() before grabbing the lock ? We need the timer_shutdown_sync() and mod_timer() of tw->tw_timer to be serialized in some way. If they aren't, we have the following race: tcp_time_wait() inet_twsk_hashdance() inet_twsk_deschedule_put() // Returns 0 because not pending, but prevents future arming timer_shutdown_sync() inet_twsk_schedule() // Returns 0 as if timer had been succesfully armed mod_timer() This means inet_twsk_deschedule_put() doesn't end up calling inet_twsk_kill() (because the timer wasn't pending when it got shutdown), but inet_twsk_schedule() doesn't arm it either despite the hashdance() having updated the refcounts. If we leave the deschedule as a del_timer_sync(), the timer ends up armed in inet_twsk_schedule(), but that means waiting for the timer to fire to clean up the resources despite having called inet_twsk_deschedule_put().