Re: iio: iio-trig-hrtimer bug on suspend/resume when used with bmi160 and bmi323

Denis Benato <benato.denis96@xxxxxxxxx> · Sun, 9 Jun 2024 03:46:05 +0200



On 6/8/24 15:58, Jonathan Cameron wrote:
> On Sat, 8 Jun 2024 03:25:13 +0200
> Denis Benato <benato.denis96@xxxxxxxxx> wrote:
> 
>> On 6/2/24 18:08, Jonathan Cameron wrote:
>>> On Thu, 30 May 2024 16:07:22 +0200
>>> Denis Benato <benato.denis96@xxxxxxxxx> wrote:
>>>   
>>>> On 1/15/24 10:37, Jonathan Cameron wrote:  
>>>>> On Sun, 14 Jan 2024 21:04:12 +0100
>>>>> Denis Benato <benato.denis96@xxxxxxxxx> wrote:
>>>>>     
>>>>>> On 1/13/24 18:46, Jonathan Cameron wrote:    
>>>>>>> On Wed, 10 Jan 2024 23:35:01 +0100
>>>>>>> Denis Benato <benato.denis96@xxxxxxxxx> wrote:
>>>>>>>       
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> With this mail I am submitting bug report that is probably related to
>>>>>>>> iio-trig-hrtimer but there is also the possibility for it to be
>>>>>>>> specific to bmi160 and bmi323.
>>>>>>>>
>>>>>>>> The described problem have been reproduced on my handheld PC (Asus
>>>>>>>> RC71L) and in another handheld PC with two different gyroscope
>>>>>>>> drivers: bmi323 (backported by me on v6.7, on RC71L) and bmi160.
>>>>>>>>
>>>>>>>> My target hardware (RC71L that yeld to this discovery) has a bmi323
>>>>>>>> chip that does not have any interrupt pins reaching the CPU, yet I
>>>>>>>> need to fetch periodically data from said device, therefore I used
>>>>>>>> iio-trig-hrtimer: created a trigger, set the device and trigger
>>>>>>>> sampling frequencies, bound the trigger to the device and enabled
>>>>>>>> buffer: data is being read and available over /dev/iio:device0.
>>>>>>>>
>>>>>>>> While in this state if I suspend my handheld I receive (from dmesg)
>>>>>>>> the warning reported below and at resume data is not coming out of
>>>>>>>> the iio device and the hrtimer appears to not be working. If I create
>>>>>>>> a new trigger and bind the new trigger to said iio device and
>>>>>>>> re-enable buffer data does come out of /dev/iio:device0 once more,
>>>>>>>> until the next sleep.
>>>>>>>>
>>>>>>>> Since this is important to me I have taken the time to look at both
>>>>>>>> drivers and iio-trig-hrtimer and I have identified three possible
>>>>>>>> reasons:
>>>>>>>>
>>>>>>>> 1) iio-trig-hrtimer won't work after suspend regardless of how it is
>>>>>>>> used (this is what I believe is the cause)      
>>>>>>> me too.      
>>>>>> who and how should investigate this issue? Would putting a kprintf in the hrtimer callback be enough to check?    
>>>>> The warning you have pretty much points at this being the problem, but sure
>>>>> you could add a print there to be absolutely sure.
>>>>>     
>>>>>>
>>>>>>
>>>>>> Just to make sure I understood correctly: is this a separate issue from the warning I receive or are those linked?    
>>>>>
>>>>> I think it's all one issue.    
>>>> Hello,
>>>>
>>>> Sorry for the delay, I was able to find some time just recently.
>>>>
>>>> Sadly I don't think anymore this is just one issue:
>>>>
>>>> I have setup a proof of concept that at suspend sets a boolean to true and at resume sets the same boolean to false, and on trigger_handler returns IRQ_NONE if the device is sleeping (everything guarded by an additional mutex) and it solved the warning: no more regmap access are being performed after the device is put to sleep, but unfortunately the acquisition does not automatically resume after waking up the device.
>>>>
>>>> How shall we proceed?  
>>>
>>> It's been a while (and a busy day of reviewing) so I'm struggling
>>> a bit with the best away forwardHello,  
>>
>> Don't worry about it, I don't have much time available either. That stuff is as precious as gold! Thank you kindly for your time!
>>>
>>> A hackyish approach might be to mask the interrupt at device side
>>> (fake interrupt used as part of the tree from a trigger to a pollfunc).
>>> A tidy wrapped up version of this might be the best route forwards
>>> though it won't currently stop the hrtimer merrily poking interrupts
>>> at no oneI am sorry but I am not sure I understood this correctly: do you mean having something "in-between" iio-trig-hrtimer and the target iio device and use that to selectively poll the device? 
>>>
>>> As a hack, could you try having your suspend call disable_irq() on
>>> the irq found in iio_dev->poll_func->irq and reenable it in resume?
>>> That might be sufficient.  
>> I used disable_irq and enable_irq as suggested and it works perfectly: I tested it four times and all four times (in a row) data flow resumed after s2idle-suspending the device.
>>
> 
> Excellent.
Hello,

> 
>> I can do more testing, but this is already looking very good and I'm happy (I don't have knowledge about iio internals so I am unable to tell if what I did will introduce some race conditions).
> 
> On that note we definitely shouldn't directly mess with the IIO internals
> as that implementation has changed in the past and might change again (though
> it's been stable for over 10 years, so probably not ;). Hence we should
> wrap this up in something like iio_suspend_triggering()/ iio_resume_triggering() 
> which will then be called from the suspend and resume callbacks of
> drivers that need to ensure no accesses once suspended.
> 
I totally agree. I can try to draft something if you are ok with it. I have never done something like this to the kernel so guidance is very welcome.

>>
>> As a reference I leave here the github link to the branch I have been experimenting with: https://github.com/NeroReflex/linux/commits/bmi323-pollfunc/ (Please don't mind that proof-of-concept code being copied and partially commented from bmc150-accel).
>>
>> No errors are printed out and combining that info with the fact it's now working it can only means pollfunc and irq are both non-NULL and non-zero at both suspend and resume.
> 
> That's true in the case you care about, but if we were putting in place general infrastructure
> it will get a little more 'interesting'.
> 
Understood, thanks.

> For example if currently if the trigger has never been set for a device
> pollfunc->irq == 0, but we don't actually clear that value when we
> detach the trigger (previously it was never used after that detach so
> it's value was irrelevant.
> 
> So at minimum we need to fix that by resetting the irq value to 0 in
> iio_trigger_detach_poll_func(), and add the checks so that if there
> is no trigger we don't try to disable the irq.
> 
this will also minimize code path divergence for use-cases not affected: I like this.

> I'm nervous that it might be possible to suspend whilst also setting
> the trigger and hence we need locking. We can probably use the core
> state lock (iio_dev->mlock) for that but we'd need care as that's
> already taken in some of these paths.
> 
I'm nervous too, but using the same mutex that is used to set a trigger seems like the right thing to do.

>>>
>>> Check poll_func goes somewhere first though and that irq is > 0
>>> I think that will call iio_trig_subirq_mask() which should block
>>> any further interrupts until it's unmasked again.
>>>
>>> We'll need to ensure this doesn't race with pollfunc going away though
>>> which will make it more complex, but the above test should tell us
>>> if there is a fairly smooth path to making this work.
>>>
>>> I'll try and find time to do some testing myself, but it won't be
>>> for a few weeks :(
>>>
>>> Thanks,
>>>
>>> Jonathan  
>> Thank you again for your guidance,
> 
> No problem.
> 
> Jonathan
> 
Best regards,
Denis

>> Denis Benato
>>>
>>>   
>>>>>     
>>>>>>>       
>>>>>>>> 2) iio-trig-hrtimer is stopped by the -ESHTDOWN returned by the
>>>>>>>> function printing "Transfer while suspended", however that stack
>>>>>>>> trace does not include function calls related to iio-trig-hrtimer and
>>>>>>>> this seems less plausible 3) bmi160 and bmi323 appears to be similar
>>>>>>>> and maybe are sharing a common bug with suspend (this is also why I
>>>>>>>> have maintainers of those drivers in the recipient list)
>>>>>>>>
>>>>>>>> Thanks for your time, patience and understanding,      
>>>>>>>
>>>>>>> Hi Denis,
>>>>>>>
>>>>>>> I suspect this is the legacy of the platform I used to test the hrtimer
>>>>>>> and similar triggers on never had working suspend and resume (we ripped
>>>>>>> support for that board out of the kernel a while back now...)
>>>>>>> Hence those paths were never tested by me and others may not have cared
>>>>>>> about this particular case.
>>>>>>>
>>>>>>> Anyhow, so I think what is going on is fairly simple.
>>>>>>>
>>>>>>> There is no way for a driver to indicate to a trigger provided by a separate
>>>>>>> module / hardware device that it should stop triggering data capture.
>>>>>>> The driver in question doesn't block data capture when suspended, which
>>>>>>> would be easy enough to add but doesn't feel like the right solution.
>>>>>>>
>>>>>>> So my initial thought is that we should add suspend and resume callbacks to
>>>>>>> iio_trigger_ops and call them manually from iio device drivers in their
>>>>>>> suspend and resume callbacks.  These would simply pause whatever the
>>>>>>> trigger source was so that no attempts are made to trigger the use of
>>>>>>> the device when it is suspended.
>>>>>>>
>>>>>>> It gets a little messy though as triggers can be shared between
>>>>>>> multiple devices so we'd need to reference count suspend and resume
>>>>>>> for the trigger to make sure we only resume once all consumers of
>>>>>>> the trigger have said they are ready to cope with triggers again.
>>>>>>>
>>>>>>> As mentioned, the alternative would be to block the triggers at ingress
>>>>>>> to the bmi323 and bmi160 drivers.  There may be a helpful pm flag that could
>>>>>>> be used but if not, then setting an is_suspended flag under the data->mutex
>>>>>>> to avoid races. and reading it in the trigger_handler to decide whether
>>>>>>> to talk to the device should work.      
>>>>>> I was thinking of doing this too, but I don't know if adding a mutex to frequently invoked functions is going to introduce some timing problems and so I was waiting for some comments on that matter. If nothing bad is expected I can surely try it.    
>>>>>>>
>>>>>>> I'd kind of like the generic solution of actually letting the trigger
>>>>>>> know, but not sure how much work it would turn out to be.  Either way there
>>>>>>> are a lot of drivers to fix this problem in as in theory most triggers can
>>>>>>> be used with most drivers that support buffered data capture.
>>>>>>> There may also be fun corners where a hardware trigger from one IIO
>>>>>>> device A is being used by another B and the suspend timing is such that B
>>>>>>> finishing with the trigger results in A taking an action (in the try_reenable
>>>>>>> callback) that could result in bus traffic.
>>>>>>> That one is going to be much more fiddly to handle than the simpler case
>>>>>>> you have run into.      
>>>>>> Since more and more handheld PCs are coming and provided many vendors such as
>>>>>> asus tends to improve what they did built on previous generations I think a
>>>>>> general solution would be desirable.
>>>>>>
>>>>>> Plus there are handheld PCs that does not yet have a driver (bmi270) or are
>>>>>> using an existing one and it would be very difficult to fix it in every of
>>>>>> them as of now, in the future I fear it will become almost impossible or 
>>>>>> extremely time consuming as market expands.    
>>>>>
>>>>> Both solutions require specific calls to be added to every driver that might
>>>>> encounter this - so most drivers that support triggers other than the ones
>>>>> they supply.
>>>>>     
>>>>>>>
>>>>>>> Thanks for the detailed bug report btw.   To get it fixed a few
>>>>>>> questions:
>>>>>>> 1) Are you happy to test proposed fixes?
>>>>>>> 2) Do you want to have a go at fixing it yourself? (I'd suggest trying
>>>>>>>    the fix in the bmi323 driver first rather than looking at the other 
>>>>>>>    solution)
>>>>>>>    If we eventually implement the more general version, then a bmi323
>>>>>>>    additional protection against this problem would not be a problem.
>>>>>>>
>>>>>>> Clearly I'd like the answers to be yes to both questions, but up to you!
>>>>>>>
>>>>>>> Jonathan
>>>>>>>
>>>>>>>       
>>>>>> Hello Jonathan and thank you kindly for you answer,
>>>>>>
>>>>>> I am very interested in the iio ecosystem as in those aforementioned
>>>>>> handheld PCs the gyroscope plays the role as a mouse so it's a pretty
>>>>>> important input device.
>>>>>>
>>>>>> I am writing to lkml not just as a single developer, but as part of a
>>>>>>  larger community in this matter: this means we will be able to test
>>>>>> any solution and in more than just one hardware.
>>>>>>
>>>>>> To answers your questions:
>>>>>> 1) yes, we all will be very happy to
>>>>>> 2) as I am very busy right now I might be unable to do that immediately,
>>>>>> but I will surely do it if one general solution cannot be found or is impractical.
>>>>>>
>>>>>> As of my limited knowledge the number of drivers now existing that are affected
>>>>>> are 2, and the number of drivers that will be affected, once the driver is
>>>>>> written, will be at least 3.    
>>>>>
>>>>> The problem appears to be general unfortunately.
>>>>>
>>>>> I'll have to see if I can fire up something where I can actually test solutions
>>>>> and I'm afraid I also don't have a lot of time at the moment.
>>>>> Perhaps I can find time in the next few weeks to hack together a prototype
>>>>> for one of the drivers you can test.
>>>>>
>>>>> Jonathan
>>>>>     
>>>>>>
>>>>>> Thank you very much for your answer,
>>>>>> Denis
>>>>>>    
>>>>>     
>>>>
>>>> Thank you for you time and patience,
>>>> Denis  
>>>   
>>
>