Re: Crash in NVME tracing on 5.10LTS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/13/24 2:34 AM, Greg KH wrote:
> On Thu, Jan 11, 2024 at 12:38:03PM -0700, Jens Axboe wrote:
>> On 1/11/24 10:00 AM, John Sperbeck wrote:
>>> On Thu, Jan 11, 2024 at 1:46?AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
>>>>> With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the
>>>>> following tracing commands will trigger a crash due to a NULL pointer
>>>>> dereference:
>>>>>
>>>>> KDIR=/sys/kernel/debug/tracing
>>>>> echo 1 > $KDIR/tracing_on
>>>>> echo 1 > $KDIR/events/nvme/enable
>>>>> echo "Waiting for trace events..."
>>>>> cat $KDIR/trace_pipe
>>>>>
>>>>> The backtrace looks something like this:
>>>>>
>>>>> Call Trace:
>>>>>  <IRQ>
>>>>>  ? __die_body+0x6b/0xb0
>>>>>  ? __die+0x9e/0xb0
>>>>>  ? no_context+0x3eb/0x460
>>>>>  ? ttwu_do_activate+0xf0/0x120
>>>>>  ? __bad_area_nosemaphore+0x157/0x200
>>>>>  ? select_idle_sibling+0x2f/0x410
>>>>>  ? bad_area_nosemaphore+0x13/0x20
>>>>>  ? do_user_addr_fault+0x2ab/0x360
>>>>>  ? exc_page_fault+0x69/0x180
>>>>>  ? asm_exc_page_fault+0x1e/0x30
>>>>>  ? trace_event_raw_event_nvme_complete_rq+0xba/0x170
>>>>>  ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170
>>>>>  nvme_complete_rq+0x168/0x170
>>>>>  nvme_pci_complete_rq+0x16c/0x1f0
>>>>>  nvme_handle_cqe+0xde/0x190
>>>>>  nvme_irq+0x78/0x100
>>>>>  __handle_irq_event_percpu+0x77/0x1e0
>>>>>  handle_irq_event+0x54/0xb0
>>>>>  handle_edge_irq+0xdf/0x230
>>>>>  asm_call_irq_on_stack+0xf/0x20
>>>>>  </IRQ>
>>>>>  common_interrupt+0x9e/0x150
>>>>>  asm_common_interrupt+0x1e/0x40
>>>>>
>>>>> It looks to me like these two upstream commits were backported to 5.10:
>>>>>
>>>>> 679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
>>>>> e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
>>>>>
>>>>> But they depend on this upstream commit to initialize the 'cmd' field in
>>>>> some cases:
>>>>>
>>>>> f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
>>>>>
>>>>> Does it sound like I'm on the right track?  The 5.15LTS and later seems to be okay.
>>>>>
>>>>
>>>> If you apply that commit, does it solve the issue for you?
>>>>
>>>> thanks,
>>>>
>>>> greg k-h
>>>
>>> The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
>>> upstream commit doesn't apply cleanly to 5.10LTS.  If I adjust it to
>>> fit, then the crash no longer occurs for me.
>>>
>>> A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in
>>> trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
>>>
>>> My leaning would be for a revert from 5.10LTS, but I think the
>>> maintainers would have better insight then me.  It's also possible
>>> that this isn't serious enough to worry about in general.  I don't
>>> really know.
>>
>> Either solution is fine with me, doesn't really matter. I was wondering
>> how this ended up in stable, and it looks like it was one of those
>> auto-selections... Those seem particularly dangerous the further back
>> you go.
> 
> Now reverted, thanks.  But note, that commit does say it fixes an issue
> this far back, which is why it was applied.

Yeah, looking into it, I can see how this was applied. And hard to know
not to, it's one of those boundary conditions that may happen for stable
backports and that only the ones intimate with the nvme code will spot.
Just one of those things, I think.

-- 
Jens Axboe





[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux