Re: [PATCH v6] watchdog: add driver for StreamLabs USB watchdog device

Alexey Klimov <klimov.linux@xxxxxxxxx> · Sat, 17 Sep 2022 04:15:02 +0100

Hi Wim/Guenter,

For me it seems that there could be a potential race condition. I have to rely
on watchdog_active(&streamlabs_wdt->wdt_dev) function which tests the WDOG_ACTIVE
bit in struct watchdog_device->status member.
The watchdog_dev changes the state of the device with ->start() or ->ping() and
->stop() methods and updates the WDOG_ACTIVE accordingly.
In {pre,post}_reset methods here I have to change the state of the device from
running to stopped and back to running conditionally, however WDOG_ACTIVE bit
could be updated in between these callbacks execution or starting/stopping
the device can race.
For instance, I see the potential dangerous race like this:

	CPUX					CPUY

	..				watchdog_stop() {
	..					if (wdd->ops->stop) {
							...
							err = wdd->ops->stop(wdd)
						}
usb_streamlabs_wdt_pre_reset() {
	if (watchdog_active())
		stop_command();			/* WDOG_ACTIVE bit is still set
	...					 here indicating that watchdog is
}						 started, but ->stop() has already
						 finished */
	...
usb_streamlabs_wdt_post_reset() {
	if (watchdog_active())
		start_command();
}
	...					/* WDOG_ACTIVE is updated here */
						clear_bit(WDOG_ACTIVE, &wdd->status);
					}

As a result, the watchdog subsystem "thinks" that watchdog is not active and should
not be pinged. However, the driver observed using watchdog_active() that watchdog
was active during {pre,post}_reset and restarted the device which will lead to
unexpected reset. It is very unlikely race to happen but consequence is fatal.
In other words, there are two independent paths leading to driver changing
the state of the watchdog device and one path relies on status that can be changed
by another path.

Thinking about that I see the following approaches:

1. Introduce a variable in struct streamlabs_wdt that tracks the state of the
watchdog device itself and checking/updating the state of a device happens under
semaphore lock.
Obviously, this "internal" to the driver state variable should be used in
{pre,post}_reset. In case there will be other drivers (say, USB ones) they also
need to implement this.

or

2. The updates to wdd->status should happen under wd_data->lock.
Currently, it is mutex-based. The acquiring and releasing the lock could be
exported for the drivers to use. The mutex lock probably should be switched
to a binary semaphore for that.

In such case, in pre_reset() for example, I would need to do:
static int pre_reset()
{
	lock_wdd();
	acquire_internal_driver_lock();

	if (watchdog_active())
		stop_command();
}

static int post_reset()
{

	if (watchdog_active())
		start_command();

	release_internal_driver_lock();
	unlock_wdd();
}

There should be an order that we have to acquire subsystem wdd lock first, then
internal driver lock. Otherwise there could be deadlocks.

This could be done if you think it's more wiser move.

or

3. The {pre,post}_reset callbacks should execute watchdog_dev.c subsystem functions
(not sure which functions exactly). Eventually, it will look similar to what is
described in the previous point with respect to locks order.
I meant something like this:

static int pre_reset()
{
	watchdog_dev_pre_reset_prepare();
}

static int post_reset()
{
	watchdog_dev_post_reset_done();
}

In watchdog_dev.c:
void watchdog_dev_pre_reset_prepare()
{
	mutex_lock(&wd_data->lock);	<-- should be changed to semaphore too?

	watchdog_stop(wdd);		<-- without updating WDOG_ACTIVE bit?
					 or there should be a way to indicate
					 to watchdog_dev_post_reset_done() if
					 watchdog should be started or not
}

void watchdog_dev_post_reset_done()
{
	if (watchdog_active())
		watchdog_start(wdd);

        mutex_unlock(&wd_data->lock);
}

I didn't really thought about point 3 yet. For me personally the point 2 seems
the like right way to go but you have more experience with that. The exported
locks could be re-used by other drivers if needed in future.
In case of point 1 each USB driver should deal with {pre,post}_reset by themselves.

Any thoughts?

Thanks,
Alexey