Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Vikash Garodia <quic_vgarodia@xxxxxxxxxxx> · Mon, 3 Mar 2025 20:31:06 +0530



On 3/3/2025 8:26 PM, Bryan O'Donoghue wrote:
> On 03/03/2025 13:12, Vikash Garodia wrote:
>>
>> On 3/2/2025 9:26 PM, Bryan O'Donoghue wrote:
>>> On 02/03/2025 11:58, Vedang Nagar wrote:
>>>>>
>>>>> The basic question : what is the lifetime of the data from RX interrupt to
>>>>> consumption by another system agent, DSP, userspace, whatever ?
>>>> As mentioned in [1], With the regular firmware, after RX interrupt the data
>>>> can be considered as valid until next interrupt is raised, but with the rouge
>>>> firmware, data can get invalid during the second read and our intention is to
>>>> avoid out of bound access read because of such issues.
>>>
>>> This is definitely the part I don't compute.
>>>
>>> 1. RX interrupt
>>> 2. Frame#0 Some amount of time data is always valid
>> This is not correct. Its not the amount of time which determines the validity of
>> the data, its the possibility of rogue firmware which, if incase, puts up the
>> date in shared queue, would always be invalid, irrespective of time.
>>
>>> 3. RX interrupt - new data
>>> 4. Frame#1 new data delivered into a buffer
>>>
>>> Are you describing a case between RX interrupts 1-3 or a case after 1-4?
>>>
>>> Why do we need to write code for rouge firmware anyway ?
>> It is a way to prevent any possibility of OOB, similar to how any API does check
>> for validity of any arguments passed to it, prior to processing.
>>>
>>> And the real question - if the data can be invalidated in the 1-3 window above
>>> when is the safe time to snapshot that data ?
>>>
>>> We seem to have alot of submissions to deal with 'rouge' firmware without I
>>> think properly describing the problem of the _expected_ data lifetime.
>>>
>>> So
>>>
>>> a) What is the expected data lifetime of an RX buffer between one
>>>     RX IRQ and the next ?
>>>     I hope the answer to this is - APSS owns the buffer.
>>>     This is BTW usually the case in these types of asymmetric setups
>>>     with a flag or some other kind of semaphore that indicates which
>>>     side of the data-exchange owns the buffer.
>>>
>>> b) In this rouge - buggy - firmware case what is the scope of the
>>>     potential race condition ?
>>>
>>>     What I'd really like to know here is why we have to seemingly
>>>     memcpy() again and again in seemingly incongrous and not
>>>     immediately obvious places in the code.
>>>
>>>     Would we not be better advised to do a memcpy() of the entire
>>>     RX frame in the RX IRQ handler path if as you appear to me
>>>     suggesting - the firmware can "race" with the APSS
>>>     i.e. the data-buffer ownership flag either doesn't work
>>>     or isn't respected by one side in the data-exchange.
>>>
>>> Can we please have a detailed description of the race condition here ?
>> Below is the report which the reporter reported leading to OOB, let me know if
>> you are unable to deduce the trail leading to OOB here.
>>
>> OOB read issue is in function event_seq_changed, please reference below code
>> snippet:
>>
>> Buggy code snippet:
>>
>> static void event_seq_changed(struct venus_core *core, struct venus_inst *inst,
>>          struct hfi_msg_event_notify_pkt *pkt)
>> ...
>> num_properties_changed = pkt->event_data2; //num_properties_changed is from
>> message and is not validated.
>> ...
>> data_ptr = (u8 *)&pkt->ext_event_data[0];
>> do {
>>   ptype = *((u32 *)data_ptr);
>>   switch (ptype) {
>>   case HFI_PROPERTY_PARAM_FRAME_SIZE:
>>    data_ptr += sizeof(u32);
>>    frame_sz = (struct hfi_framesize *)data_ptr;
>>    event.width = frame_sz->width;
>> ...
>>   }
>>   num_properties_changed--;
>> } while (num_properties_changed > 0);
>> ```
>> There is no validation against `num_properties_changed = pkt->event_data2`, so
>> OOB read occurs.
>>>
>>> I don't doubt the new memcpy() makes sense to you but without this detailed
>>> understanding of the underlying problem its virtually impossible to debate the
>>> appropriate remediation - perhaps this patch you've submitted - or some other
>>> solution.
>>>
>>> Sorry to dig into my trench here but, way more detail is needed.
>>>
>>>> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2-
>>>> b4a0aad1069e@xxxxxxxxxx/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
>>>>>
>>>>> Why is it in this small specific window that the data can change but not
>>>>> later ? What is the mechanism the data can change and how do the changes you
>>>>> propose here address the data lifetime problem ?
>>>> Currently this issue has been discovered by external researchers at this
>>>> point, but if any such OOB issue is discovered at later point as well then we
>>>> shall fix them as well.
>>>
>>> Right but, I'm looking for a detailed description of the problem.
>>>
>>> Can you describe from RX interrupt again what the expected data lifetime of the
>>> RX frame is, which I hope we agree is until the next RX interrupt associated
>>> with a given buffer with an ownership flag shared between firmware and APSS -
>>> and then under what circumstances that "software contract" is being violated.
>>>
>>>> Also, with rougue firmware we cannot fix the data lifetime problem in my
>>>> opinion, but atleast we can fix the out of bound issues.
>>>>>
>>>>> Without that context, I don't believe it is really possible to validate an
>>>>> additional memcpy() here and there in the code as fixing anything.
>>>> There is no additional memcpy() now in the v2 patch, but as part of the fix,
>>>> we are just trying to retain the length of the packet which was being read in
>>>> the first memcpy() to avoid the OOB read access.
>>>
>>> I can't make a suggestion because - personally speaking I still don't quite
>>> understand the data-race you are describing.
>> Go through the reports from the reporter, it was quite evident in leading upto
>> OOB case.
>> Putting up the sequence for you to go over the interrupt handling and message
>> queue parsing of the packets from firmware
>> 1.
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_venus.c#L1082
>> 2.
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L816
>> 3. event handling (this particular case)
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L658
>> 4.
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L22
>>
>> the "struct hfi_msg_event_notify_pkt *pkt" pkt here is having the data read from
>> shared queue.
>>
>>>
>>> I get that you say the firmware is breaking the contract but, without more
>>> detail on _how_ it breaks that contract I don't think it's really possible to
>>> validate your fix here, fixes anything.
>>>
>>> ---
>>> bod
>>
>> Regards,
>> Vikash
> 
> I'll go through all of these links given here, thanks.
I would request you to go through the description putup by the reporter of this
OOB as well, i added in my earlier response. It provided a good background of
how the firmware response can led to this particular OOB, atleast that was the
source of OOB info for us.
Regards,
Vikash
> 
> Whatever the result of the review, this detail needs to go into the commit log
> so that a reviewer can reasonably read the problem description and evaluate
> against submitted code as a fix.
> 
> ---
> bod