On 3/3/2025 8:26 PM, Bryan O'Donoghue wrote: > On 03/03/2025 13:12, Vikash Garodia wrote: >> >> On 3/2/2025 9:26 PM, Bryan O'Donoghue wrote: >>> On 02/03/2025 11:58, Vedang Nagar wrote: >>>>> >>>>> The basic question : what is the lifetime of the data from RX interrupt to >>>>> consumption by another system agent, DSP, userspace, whatever ? >>>> As mentioned in [1], With the regular firmware, after RX interrupt the data >>>> can be considered as valid until next interrupt is raised, but with the rouge >>>> firmware, data can get invalid during the second read and our intention is to >>>> avoid out of bound access read because of such issues. >>> >>> This is definitely the part I don't compute. >>> >>> 1. RX interrupt >>> 2. Frame#0 Some amount of time data is always valid >> This is not correct. Its not the amount of time which determines the validity of >> the data, its the possibility of rogue firmware which, if incase, puts up the >> date in shared queue, would always be invalid, irrespective of time. >> >>> 3. RX interrupt - new data >>> 4. Frame#1 new data delivered into a buffer >>> >>> Are you describing a case between RX interrupts 1-3 or a case after 1-4? >>> >>> Why do we need to write code for rouge firmware anyway ? >> It is a way to prevent any possibility of OOB, similar to how any API does check >> for validity of any arguments passed to it, prior to processing. >>> >>> And the real question - if the data can be invalidated in the 1-3 window above >>> when is the safe time to snapshot that data ? >>> >>> We seem to have alot of submissions to deal with 'rouge' firmware without I >>> think properly describing the problem of the _expected_ data lifetime. >>> >>> So >>> >>> a) What is the expected data lifetime of an RX buffer between one >>> RX IRQ and the next ? >>> I hope the answer to this is - APSS owns the buffer. >>> This is BTW usually the case in these types of asymmetric setups >>> with a flag or some other kind of semaphore that indicates which >>> side of the data-exchange owns the buffer. >>> >>> b) In this rouge - buggy - firmware case what is the scope of the >>> potential race condition ? >>> >>> What I'd really like to know here is why we have to seemingly >>> memcpy() again and again in seemingly incongrous and not >>> immediately obvious places in the code. >>> >>> Would we not be better advised to do a memcpy() of the entire >>> RX frame in the RX IRQ handler path if as you appear to me >>> suggesting - the firmware can "race" with the APSS >>> i.e. the data-buffer ownership flag either doesn't work >>> or isn't respected by one side in the data-exchange. >>> >>> Can we please have a detailed description of the race condition here ? >> Below is the report which the reporter reported leading to OOB, let me know if >> you are unable to deduce the trail leading to OOB here. >> >> OOB read issue is in function event_seq_changed, please reference below code >> snippet: >> >> Buggy code snippet: >> >> static void event_seq_changed(struct venus_core *core, struct venus_inst *inst, >> struct hfi_msg_event_notify_pkt *pkt) >> ... >> num_properties_changed = pkt->event_data2; //num_properties_changed is from >> message and is not validated. >> ... >> data_ptr = (u8 *)&pkt->ext_event_data[0]; >> do { >> ptype = *((u32 *)data_ptr); >> switch (ptype) { >> case HFI_PROPERTY_PARAM_FRAME_SIZE: >> data_ptr += sizeof(u32); >> frame_sz = (struct hfi_framesize *)data_ptr; >> event.width = frame_sz->width; >> ... >> } >> num_properties_changed--; >> } while (num_properties_changed > 0); >> ``` >> There is no validation against `num_properties_changed = pkt->event_data2`, so >> OOB read occurs. >>> >>> I don't doubt the new memcpy() makes sense to you but without this detailed >>> understanding of the underlying problem its virtually impossible to debate the >>> appropriate remediation - perhaps this patch you've submitted - or some other >>> solution. >>> >>> Sorry to dig into my trench here but, way more detail is needed. >>> >>>> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2- >>>> b4a0aad1069e@xxxxxxxxxx/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619 >>>>> >>>>> Why is it in this small specific window that the data can change but not >>>>> later ? What is the mechanism the data can change and how do the changes you >>>>> propose here address the data lifetime problem ? >>>> Currently this issue has been discovered by external researchers at this >>>> point, but if any such OOB issue is discovered at later point as well then we >>>> shall fix them as well. >>> >>> Right but, I'm looking for a detailed description of the problem. >>> >>> Can you describe from RX interrupt again what the expected data lifetime of the >>> RX frame is, which I hope we agree is until the next RX interrupt associated >>> with a given buffer with an ownership flag shared between firmware and APSS - >>> and then under what circumstances that "software contract" is being violated. >>> >>>> Also, with rougue firmware we cannot fix the data lifetime problem in my >>>> opinion, but atleast we can fix the out of bound issues. >>>>> >>>>> Without that context, I don't believe it is really possible to validate an >>>>> additional memcpy() here and there in the code as fixing anything. >>>> There is no additional memcpy() now in the v2 patch, but as part of the fix, >>>> we are just trying to retain the length of the packet which was being read in >>>> the first memcpy() to avoid the OOB read access. >>> >>> I can't make a suggestion because - personally speaking I still don't quite >>> understand the data-race you are describing. >> Go through the reports from the reporter, it was quite evident in leading upto >> OOB case. >> Putting up the sequence for you to go over the interrupt handling and message >> queue parsing of the packets from firmware >> 1. >> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_venus.c#L1082 >> 2. >> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L816 >> 3. event handling (this particular case) >> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L658 >> 4. >> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L22 >> >> the "struct hfi_msg_event_notify_pkt *pkt" pkt here is having the data read from >> shared queue. >> >>> >>> I get that you say the firmware is breaking the contract but, without more >>> detail on _how_ it breaks that contract I don't think it's really possible to >>> validate your fix here, fixes anything. >>> >>> --- >>> bod >> >> Regards, >> Vikash > > I'll go through all of these links given here, thanks. I would request you to go through the description putup by the reporter of this OOB as well, i added in my earlier response. It provided a good background of how the firmware response can led to this particular OOB, atleast that was the source of OOB info for us. Regards, Vikash > > Whatever the result of the review, this detail needs to go into the commit log > so that a reviewer can reasonably read the problem description and evaluate > against submitted code as a fix. > > --- > bod