Re: [PATCH] Bluetooth: Limit duration of Remote Name Resolve

Marcel Holtmann <marcel@xxxxxxxxxxxx> · Fri, 29 Oct 2021 12:11:19 +0200

Hi Archie,

>>>>>> When doing remote name request, we cannot scan. In the normal case it's
>>>>>> OK since we can expect it to finish within a short amount of time.
>>>>>> However, there is a possibility to scan lots of devices that
>>>>>> (1) requires Remote Name Resolve
>>>>>> (2) is unresponsive to Remote Name Resolve
>>>>>> When this happens, we are stuck to do Remote Name Resolve until all is
>>>>>> done before continue scanning.
>>>>>> 
>>>>>> This patch adds a time limit to stop us spending too long on remote
>>>>>> name request. The limit is increased for every iteration where we fail
>>>>>> to complete the RNR in order to eventually solve all names.
>>>>>> 
>>>>>> Signed-off-by: Archie Pusaka <apusaka@xxxxxxxxxxxx>
>>>>>> Reviewed-by: Miao-chen Chou <mcchou@xxxxxxxxxxxx>
>>>>>> 
>>>>>> ---
>>>>>> Hi maintainers, we found one instance where a test device spends ~90
>>>>>> seconds to do Remote Name Resolving, hence this patch.
>>>>>> I think it's better if we reset the time limit to the default value
>>>>>> at some point, but I don't have a good proposal where to do that, so
>>>>>> in the end I didn't.
>>>>> 
>>>>> do you have a btmon trace for this as well?
>>>>> 
>>> Yes, but only from the scanning device side. It's all lined up with
>>> your expectation (e.g. receiving Page Timeout in RNR Complete event).
>>> 
>>>>> The HCI Remote Name Request is essentially a paging procedure and then a few LMP messages. It is fundamentally a connection request inside BR/EDR and if you have a remote device that has page scan disabled, but inquiry scan enabled, then you get into this funky situation. Sadly, the BR/EDR parts don’t give you any hint on this weird combination. You can't configure BlueZ that way since it is really stupid setup and I remember that GAP doesn’t have this case either, but it can happen. So we might want to check if that is what happens. And of course it needs to be a Bluetooth 2.0 device or a device that doesn’t support Secure Simple Pairing. There is a chance of really bad radio interference, but that is then just bad luck and is only going to happen every once in a blue moon.
>>>> 
>>> It might be the case. I don't know the peer device, but it looks like
>>> the user has a lot of these exact peer devices sitting in the same
>>> room.
>>> Or another possibility would be the user just turned bluetooth off for
>>> these devices just after we scan them, such that they don't answer the
>>> RNR.
>>> 
>>>> I wonder what does the remote sets as Page_Scan_Repetition_Mode in the
>>>> Inquiry Result, it seems quite weird that the specs allows such stage
>>>> but it doesn't have a value to represent in the inquiry result, anyway
>>>> I guess changing that now wouldn't really make any different given
>>>> such device is probably never gonna update.
>>>> 
>>> The page scan repetition mode is R1
>> 
>> not sure if this actually matters if your clock drifted too much apart.
>> 
>>>>> That said, you should receive a Page Timeout in the Remote Name Request Complete event for what you describe. Or you just use HCI Remote Name Request Cancel to abort the paging. If I remember correctly then the setting for Page Timeout is also applied to Remote Name resolving procedure. So we could tweak that value. Actually once we get the “sync” work merged, we could configure different Page Timeout for connection requests and name resolving if that would help. Not sure if this is worth it, since we could as simple just cancel the request.
>>>> 
>>>> If I recall this correctly we used to have something like that back in
>>>> the days the daemon had control over the discovery, the logic was that
>>>> each round of discovery including the name resolving had a fixed time
>>>> e.g. 10 sec, so if not all device found had their name resolved we
>>>> would stop and proceed to the next round that way we avoid this
>>>> problem of devices not resolving and nothing being discovered either.
>>>> Luckily today there might not be many devices around without EIR
>>>> including their names but still I think it would be better to limit
>>>> the amount time we spend resolving names, also it looks like it sets
>>>> NAME_NOT_KNOWN when RNR fails and it never proceeds to request the
>>>> name again so I wonder why would it be waiting ~90 seconds, we don't
>>>> seem to change the page timeout so it should be using the default
>>>> which is 5.12s so I think there is something else at play.
>>>> 
>>> Yeah, we received the Page Timeout after 5s, but then we proceed to
>>> continue RNR the next device, which takes another 5s, and so on.
>>> A couple of these devices can push waiting time over 90s.
>>> Looking at this, I don't think cancelling RNR would help much.
>>> This patch would like to reintroduce the time limit, but I decided to
>>> make the time limit grow, otherwise the bad RNR might take the whole
>>> time limit and we can't resolve any names.
>> 
>> I am wondering if we should add a new flag to Device Found that will indicate Name Resolving Failed after the first Page Timeout and then bluetoothd can decide via Confirm Name mgmt command to trigger the resolving or not. We can even add a 0x02 for Don’t Care About The Name.
>> 
> This is a great idea.
> However I checked that we remove the discovered device cache after
> every scan iteration.
> While I am not clear about the purpose of the cache cleanup, I had
> assumed that keeping a list of devices with bad RNR record would go
> against the intention of cleaning up the cache.
> 
> If we are to bookkeep the list of bad devices, we might as well take
> this record into account when sorting the RNR queue, so the bad
> devices will be sent to the back of the queue regardless how good the
> RSSI is.

the inquiry cache is solely for name resolving and connection request so that you are able to fill in the right values to speed up the paging.

I think it is just enough to include a flag hinting the resolving failure into the Device Found message. We are sending two Device Found anyway on success. So now we get one on failure as well. And then lets bluetoothd do all the caching if it wants to.

Regards

Marcel