Re: [PATCH 1/2] HID: logitech-hidpp: Avoid hidpp_connect_event() running while probe() restarts IO

Hans de Goede <hdegoede@xxxxxxxxxx> · Fri, 6 Oct 2023 19:56:37 +0200

Hi Benjamin,

On 10/6/23 19:24, Hans de Goede wrote:
> Hi,
> 
> On 10/6/23 18:28, Benjamin Tissoires wrote:
>> On Oct 06 2023, Hans de Goede wrote:
> 
> <snip>
> 
>>>>> @@ -4207,36 +4208,39 @@ static void hidpp_connect_event(struct hidpp_device *hidpp)
>>>>>  		return;
>>>>>  	}
>>>>>  
>>>>> +	/* Avoid probe() restarting IO */
>>>>> +	mutex_lock(&hidpp->io_mutex);
>>>>
>>>> I'd put a `__must_hold(&hidpp->io_mutex);` here, not changing any return
>>>> path and forcing any caller to `hidpp_connect_event()` (which will
>>>> eventually only be the work struct) to take the lock.
>>>>
>>>> This should simplify the patch by a lot and also ensure someone doesn't
>>>> forget the `goto out_unlock`.
>>>
>>> Ok, I can add the __must_hold() here and make 
>>> delayed_Work_cb take the lock, but that would make it
>>> impossible to implement patch 2/2 in a clean manner and
>>> I do like patch 2/2 since it makes it clear that
>>> hidpp_connect_event must only run from the workqueue
>>> but I guess we could just add a comment for that
>>> instead.
>>
>> In 2/2, just rename this function to __do_hidpp_connect_event(), and
>> have hidpp_connect_event() being the worker, which takes the lock, and
>> calls __do_hidpp_connect_event().
> 
> Ok, will do for v2.
> 
> <snip>
> 
>>>>> @@ -4519,6 +4526,9 @@ static int hidpp_probe(struct hid_device *hdev, const struct hid_device_id *id)
>>>>>  	flush_work(&hidpp->work);
>>>>>  
>>>>>  	if (will_restart) {
>>>>> +		/* Avoid hidpp_connect_event() running while restarting */
>>>>> +		mutex_lock(&hidpp->io_mutex);
>>>>> +
>>>>>  		/* Reset the HID node state */
>>>>>  		hid_device_io_stop(hdev);
>>>>
>>>> That's the part that makes me raise an eyebrow. Because we lock, then
>>>> release the semaphore to get it back later. Can this induce a dead lock?
>>>>
>>>> Can't we solve that same scenario without a mutex, but forcing either
>>>> the workqueue to not run or to be finished at this point?
>>>
>>> I'm not sure what you are worried about after the mutex_lock
>>> the line above we are 100% guaranteed that hidpp_connect_event()
>>> is not running and since it is not running it will also not
>>> be holding any other locks, so it can not cause any problems.
>>
>> Agree, but my point is that you are not entirely solving the issue:
>> if now, between hid_device_io_stop() and hid_hw_close() we receive a
>> connect notification from the device, hid_input_report() will return
>> -EBUSY, and we will lose it (it will not be stacked in the workqueue).
>>
>> I was thinking at adding a flush_work(&hidpp->work) here, instead of
>> the mutex solution, but yours ensures that any connect event already
>> started will be handled properly, which is a plus.
>>
>> Still if between the mutex lock here we receive a connect event from the
>> device, we still get -EBUSY at the hid-core layer, and so we will lose
>> it. Maybe that's OK because we might re-ask for the device later (I
>> don't remember exactly the code), but my point is that because we add a
>> mutex doesn't mean we will solve all multi-thread problems. So finding a
>> non-mutex solution sometimes is better :)
>>
>> And the fact that we need to think through every preemption case often
>> means that there is something wrong *elsewhere*.
> 
> Right, I did consider seeing if we can get rid of the restart
> altogether, as the whole restarting thing is actually the problem
> here. AFAICT this is only really necessary in the WTP path since
> there where we need to know resolution before instantiating the
> input device.
> 
> But atm this is also done for all unifying devices, which seems
> unnecessary.
> 
> Buy we still need the restart anyways for the WTP case,
> so we need to make it work reliable anyways.
> 
> Now that I understand your concern about the missed connect
> packet, which I agree is a real concern I think I know how to
> fix this. I'll prepare a new version of this series tomorrow.
> 
> Hmm, thinking more about this, if we normally just create
> the input device right away even for unifying devices and
> we already always delay the creation for WTP even during
> the restart:
> 
>                 if (hidpp->quirks & HIDPP_QUIRK_DELAYED_INIT)
>                         connect_mask &= ~HID_CONNECT_HIDINPUT;
> 
> Then I wonder why do we even bother to do the restart
> thing for unifying devices. Do you know what this is based on ?
> 
> I guess this might have to do with ensuring the configure
> commands are send before creating the input + hidraw
> devices, but if the connect event comes later on then
> the configuration is already done later on after
> the input device has already been created ?
> 
> So maybe we should indeed just remove the whole restart
> thing entirely and also always rely on hidpp_connect_event
> to send the configuration commands, because currently
> those are send twice if the device is already connnected
> at probe() time.

The whole restart thing seems to have been added by you in:
91cf9a98ae4127 ("HID: logitech-hidpp: make .probe usbhid capable")

The commit message says that not starting all the HID
subdrivers right from the start is necessary because:

"It means that any configuration retrieved after the initial hid_hw_start
would not be exposed to user space without races."

I believe this refers to the filling of hdev->name
and hdev->uniq by hidpp_unifying_init()
(or other similar code paths for non unifying).

It seems that has been broken though by:

498ba20690357 ("HID: logitech-hidpp: Don't restart communication if not necessary")

Which causes input-dev registration to happen before
hidpp_unifying_init().

TL;DR:

1. There seems to be a good reason for the restart dance so
we need to fix it rather then remove it.

2. 498ba20690357 ("HID: logitech-hidpp: Don't restart communication if not necessary")
(from Bastien) re-introduces the race which the restart tries
to fix and should be reverted.

Regards,

Hans