On 9/16/22 20:15, Alexey Klimov wrote:
Hi Wim/Guenter, For me it seems that there could be a potential race condition. I have to rely on watchdog_active(&streamlabs_wdt->wdt_dev) function which tests the WDOG_ACTIVE bit in struct watchdog_device->status member. The watchdog_dev changes the state of the device with ->start() or ->ping() and ->stop() methods and updates the WDOG_ACTIVE accordingly. In {pre,post}_reset methods here I have to change the state of the device from running to stopped and back to running conditionally, however WDOG_ACTIVE bit could be updated in between these callbacks execution or starting/stopping the device can race. For instance, I see the potential dangerous race like this: CPUX CPUY .. watchdog_stop() { .. if (wdd->ops->stop) { ... err = wdd->ops->stop(wdd) } usb_streamlabs_wdt_pre_reset() { if (watchdog_active()) stop_command(); /* WDOG_ACTIVE bit is still set ... here indicating that watchdog is } started, but ->stop() has already finished */ ... usb_streamlabs_wdt_post_reset() { if (watchdog_active()) start_command(); } ... /* WDOG_ACTIVE is updated here */ clear_bit(WDOG_ACTIVE, &wdd->status); } As a result, the watchdog subsystem "thinks" that watchdog is not active and should not be pinged. However, the driver observed using watchdog_active() that watchdog was active during {pre,post}_reset and restarted the device which will lead to unexpected reset. It is very unlikely race to happen but consequence is fatal. In other words, there are two independent paths leading to driver changing the state of the watchdog device and one path relies on status that can be changed by another path. Thinking about that I see the following approaches: 1. Introduce a variable in struct streamlabs_wdt that tracks the state of the watchdog device itself and checking/updating the state of a device happens under semaphore lock. Obviously, this "internal" to the driver state variable should be used in {pre,post}_reset. In case there will be other drivers (say, USB ones) they also need to implement this. or 2. The updates to wdd->status should happen under wd_data->lock. Currently, it is mutex-based. The acquiring and releasing the lock could be exported for the drivers to use. The mutex lock probably should be switched to a binary semaphore for that. In such case, in pre_reset() for example, I would need to do: static int pre_reset() { lock_wdd(); acquire_internal_driver_lock(); if (watchdog_active()) stop_command(); } static int post_reset() { if (watchdog_active()) start_command(); release_internal_driver_lock(); unlock_wdd(); } There should be an order that we have to acquire subsystem wdd lock first, then internal driver lock. Otherwise there could be deadlocks. This could be done if you think it's more wiser move. or 3. The {pre,post}_reset callbacks should execute watchdog_dev.c subsystem functions (not sure which functions exactly). Eventually, it will look similar to what is described in the previous point with respect to locks order. I meant something like this: static int pre_reset() { watchdog_dev_pre_reset_prepare(); } static int post_reset() { watchdog_dev_post_reset_done(); } In watchdog_dev.c: void watchdog_dev_pre_reset_prepare() { mutex_lock(&wd_data->lock); <-- should be changed to semaphore too? watchdog_stop(wdd); <-- without updating WDOG_ACTIVE bit? or there should be a way to indicate to watchdog_dev_post_reset_done() if watchdog should be started or not } void watchdog_dev_post_reset_done() { if (watchdog_active()) watchdog_start(wdd); mutex_unlock(&wd_data->lock); } I didn't really thought about point 3 yet. For me personally the point 2 seems the like right way to go but you have more experience with that. The exported locks could be re-used by other drivers if needed in future. In case of point 1 each USB driver should deal with {pre,post}_reset by themselves. Any thoughts?
Please go with 1). pre_reset/post_reset functionality is a first in the watchdog subsystem and the first to require locking outside the scope of a function or set of functions. I'd rather avoid having to deal with the potential consequences in the watchdog core. We can do that if/when it becomes more common and after we have a good understanding of the potential consequences. Thanks, Guenter
Thanks, Alexey