Re: Reporting a use-after-free in amdgpu and null-ptr-deref

Joonkyo Jung <joonkyoj@xxxxxxxxxxxx> · Thu, 22 Feb 2024 11:42:56 +0900

Hi Vitaly,

Thank you for looking into this issue!

We have reproduced this issue with a Radeon RX 580 (Polaris 20) passthrough-ed to a QEMU (4.0.0) VM by VFIO.
All bugs were reproducible on the recent 6.8-rc4 Linux kernel (https://github.com/torvalds/linux/tree/v6.8-rc4), which I double checked right now with previous programs.

Below are the QEMU arguments used, in-VM lspci -vvv and /proc/cpuinfo.
Should you need any more information, please let us know.

QEMU arguments
qemu-system-x86_64
-m 2G \
-cpu host \
-kernel $KERNEL \
-append "console=ttyS0 root=/dev/sda earlyprintk=serial net.ifnames=0" \
-drive file=$DRIVE_FILE,format=qcow2 \
-enable-kvm \
-device vfio-pci,host=$PCI_ADDR,id=gpu,multifunction=on,x-vga=on \
-nographic

root@qemu:~# lspci -vvv
00:03.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 47)
        Subsystem: Gigabyte Technology Co., Ltd Ellesmere [Radeon RX 470/480/570/570X/580/580X/59]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <-
        Latency: 0
        Interrupt: pin A routed to IRQ 24
        Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at f0000000 (64-bit, prefetchable) [size=2M]
        Region 4: I/O ports at c000 [size=256]
        Region 5: Memory at feb80000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, Max1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPha-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee01004  Data: 0021
        Kernel driver in use: amdgpu

root@snapuzz:~# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 183
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
stepping        : 1
microcode       : 0x1
cpu MHz         : 2995.200
cache size      : 16384 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 31
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflushs
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexprioritg
bugs            : spectre_v1 spectre_v2 spec_store_bypass mds swapgs itlb_multihit mmio_unknown eb
bogomips        : 5990.40
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

On Thu, Feb 22, 2024 at 10:57 AM vitaly prosyak <vprosyak@xxxxxxx> wrote:

    Hi Joonkyo,
    Thanks for your reporting!
    I reproduced the first issue with 'amdgpu_gem_userptr_ioctl' when
      KAZAN enabled, but i could not reproduce the other two issues.
    Could you indicate what ASIC did you use to reproduce the the
      issue?
    Could you provide details of your system?
    Much appreciated for your responce.
    Vitaly

    I placed your findings below to keep the context and track the
      issues.

    ===========================================================================================================================

    Reporting a slab-use-after-free in amdgpu.eml
Subject: 
Reporting a slab-use-after-free in amdgpu
From: 
Joonkyo Jung <joonkyoj@xxxxxxxxxxxx>
Date: 
2024-02-16, 04:22
To: 
alexander.deucher@xxxxxxx, christian.koenig@xxxxxxx, Xinhui.Pan@xxxxxxx
CC: 
amd-gfx@xxxxxxxxxxxxxxxxxxxxx, Dokyung Song <dokyungs@xxxxxxxxxxxx>,  jisoo.jang@xxxxxxxxxxxx, yw9865@xxxxxxxxxxxx

Hello,

We
 would like to report a slab-use-after-free bug in the AMDGPU DRM driver
 in the linux kernel v6.8-rc4 that we found with our customized 
Syzkaller.
The bug can be triggered by sending two ioctls to the AMDGPU DRM driver in succession.

In amdgpu_bo_move, struct ttm_resource *old_mem = bo->resource is assigned.
As you can see on the alloc & free stack calls, on the same function amdgpu_bo_move,
amdgpu_move_blit in the end frees bo->resource at ttm_bo_move_accel_cleanup with ttm_bo_wait_free_node(bo, man->use_tt).
But
 amdgpu_bo_move continues after that, reaching trace_amdgpu_bo_move(abo,
 new_mem->mem_type, old_mem->mem_type) at the end, causing the 
use-after-free bug.

Steps to reproduce are as below.
union drm_amdgpu_gem_create *arg1;

arg1 = malloc(sizeof(union drm_amdgpu_gem_create));
arg1->in.bo_size = 0x8;
arg1->in.alignment = 0x0;
arg1->in.domains = 0x4;
arg1->in.domain_flags = 0x9;
ioctl(fd, 0xc0206440, arg1);

arg1->in.bo_size = 0x7fffffff;
arg1->in.alignment = 0x0;
arg1->in.domains = 0x4;
arg1->in.domain_flags = 0x9;
ioctl(fd, 0xc0206440, arg1);

The KASAN report is as follows:
==================================================================
BUG: KASAN: slab-use-after-free in amdgpu_bo_move+0x1479/0x1550
Read of size 4 at addr ffff88800f5bee80 by task syz-executor/219
Call Trace:
 <TASK>
 amdgpu_bo_move+0x1479/0x1550
 ttm_bo_handle_move_mem+0x4d0/0x700
 ttm_mem_evict_first+0x945/0x1230
 ttm_bo_mem_space+0x6c7/0x940
 ttm_bo_validate+0x286/0x650
 ttm_bo_init_reserved+0x34c/0x490
 amdgpu_bo_create+0x94b/0x1610
 amdgpu_bo_create_user+0xa3/0x130
 amdgpu_gem_create_ioctl+0x4bc/0xc10
 drm_ioctl_kernel+0x300/0x410
 drm_ioctl+0x648/0xb30
 amdgpu_drm_ioctl+0xc8/0x160
 </TASK>

Allocated by task 219:
 kmalloc_trace+0x211/0x390
 amdgpu_vram_mgr_new+0x1d6/0xbe0
 ttm_resource_alloc+0xfd/0x1e0
 ttm_bo_mem_space+0x255/0x940
 ttm_bo_validate+0x286/0x650
 ttm_bo_init_reserved+0x34c/0x490
 amdgpu_bo_create+0x94b/0x1610
 amdgpu_bo_create_user+0xa3/0x130
 amdgpu_gem_create_ioctl+0x4bc/0xc10
 drm_ioctl_kernel+0x300/0x410
 drm_ioctl+0x648/0xb30
 amdgpu_drm_ioctl+0xc8/0x160

Freed by task 219:
 kfree+0x111/0x2d0
 ttm_resource_free+0x17e/0x1e0
 ttm_bo_move_accel_cleanup+0x77e/0x9b0
 amdgpu_move_blit+0x3db/0x670
 amdgpu_bo_move+0xfa2/0x1550
 ttm_bo_handle_move_mem+0x4d0/0x700
 ttm_mem_evict_first+0x945/0x1230
 ttm_bo_mem_space+0x6c7/0x940
 ttm_bo_validate+0x286/0x650
 ttm_bo_init_reserved+0x34c/0x490
 amdgpu_bo_create+0x94b/0x1610
 amdgpu_bo_create_user+0xa3/0x130
 amdgpu_gem_create_ioctl+0x4bc/0xc10
 drm_ioctl_kernel+0x300/0x410
 drm_ioctl+0x648/0xb30
 amdgpu_drm_ioctl+0xc8/0x160

The buggy address belongs to the object at ffff88800f5bee70
 which belongs to the cache kmalloc-96 of size 96
The buggy address is located 16 bytes inside of
 freed 96-byte region [ffff88800f5bee70, ffff88800f5beed0)

Should you need any more information, please do not hesitate to contact us.

Best regards,
Joonkyo Jung

Reporting a null-ptr-deref in amdgpu.eml
Subject: 
Reporting a null-ptr-deref in amdgpu
From: 
Joonkyo Jung <joonkyoj@xxxxxxxxxxxx>
Date: 
2024-02-16, 04:20
To: 
alexander.deucher@xxxxxxx, christian.koenig@xxxxxxx, Xinhui.Pan@xxxxxxx
CC: 
Dokyung Song <dokyungs@xxxxxxxxxxxx>, jisoo.jang@xxxxxxxxxxxx, yw9865@xxxxxxxxxxxx,  amd-gfx@xxxxxxxxxxxxxxxxxxxxx

Hello,

We
 would like to report a null-ptr-deref bug in the AMDGPU DRM driver in 
the linux kernel v6.8-rc4 that we found with our customized Syzkaller.
The bug can be triggered by sending two ioctls to the AMDGPU DRM driver in succession.

The first ioctl amdgpu_ctx_ioctl will create a ctx, and return ctx_id = 1 to the userspace.
Second
 ioctl, actually any ioctl that will eventually call 
amdgpu_ctx_get_entity, carries this ctx_id and passes the context check.
Here, for example, drm_amdgpu_wait_cs.
Validations in amdgpu_ctx_get_entity can also be passed by the user-provided values from the ioctl arguments.
This
 eventually leads to drm_sched_entity_init, where the null-ptr-deref 
will trigger on RCU_INIT_POINTER(entity->last_scheduled, NULL);

Steps to reproduce are as below.
union drm_amdgpu_ctx *arg1;
union drm_amdgpu_wait_cs *arg2;

arg1 = malloc(sizeof(union drm_amdgpu_ctx));
arg2 = malloc(sizeof(union drm_amdgpu_wait_cs));

arg1->in.op = 0x1;
ioctl(AMDGPU_renderD128_DEVICE_FILE, 0x140106442, arg1);

arg2->in.handle = 0x0;
arg2->in.timeout = 0x2000000000000;
arg2->in.ip_type = 0x9;
arg2->in.ip_instance = 0x0;
arg2->in.ring = 0x0;
arg2->in.ctx_id = 0x1;
ioctl(AMDGPU_renderD128_DEVICE_FILE, 0xc0206449, arg2);

The KASAN report is as follows:
general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
Call Trace:
 <TASK>
 ? drm_sched_entity_init+0x16e/0x650
 ? drm_sched_entity_init+0x208/0x650
 amdgpu_ctx_get_entity+0x944/0xc30
 amdgpu_cs_wait_ioctl+0x13d/0x3f0
 drm_ioctl_kernel+0x300/0x410
 drm_ioctl+0x648/0xb30
 amdgpu_drm_ioctl+0xc8/0x160
 </TASK>

Should you need any more information, please do not hesitate to contact us.

Best regards,
Joonkyo Jung

Reporting a use-after-free in amdgpu.eml
Subject: 
Reporting a use-after-free in amdgpu
From: 
정준교 <joonkyoj@xxxxxxxxxxxx>
Date: 
2024-02-14, 21:34
To: 
alexander.deucher@xxxxxxx, christian.koenig@xxxxxxx, Xinhui.Pan@xxxxxxx
CC: 
amd-gfx@xxxxxxxxxxxxxxxxxxxxx, Dokyung Song <dokyungs@xxxxxxxxxxxx>,  jisoo.jang@xxxxxxxxxxxx, yw9865@xxxxxxxxxxxx

Hello,

We
 would like to report a use-after-free bug in the AMDGPU DRM driver in 
the linux kernel 6.2 that we found with our customized Syzkaller.
The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver, with invalid addr and size. 
We have tested that this bug can still be triggered in the latest RC kernel, v6.8-rc4.

Steps to reproduce are as below.

struct drm_amdgpu_gem_userptr *arg;
arg = malloc(sizeof(struct drm_amdgpu_gem_userptr));
arg->addr = 0xffffffffffff0000;
arg->size = 0x80000000;
arg->flags = 0x7;
ioctl(AMDGPU_renderD128_DEVICE_FILE, 0xc1186451, arg);

The KASAN report is as follows:
==================================================================
BUG: KASAN: use-after-free in switch_mm_irqs_off+0x89d/0xb70
Read of size 8 at addr ffff88801f17bc00 by task syz-executor/386
Call Trace:
<TASK>
switch_mm_irqs_off+0x89d/0xb70
__schedule+0xa62/0x2630
preempt_schedule_common+0x45/0xd0
vfree+0x4d/0x60
ttm_tt_fini+0xdf/0x1c0
amdgpu_ttm_backend_destroy+0x9f/0xe0
ttm_bo_cleanup_memtype_use+0x142/0x1f0
ttm_bo_release+0x67d/0xc00
ttm_bo_put+0x7c/0xa0
amdgpu_bo_unref+0x3b/0x80
amdgpu_gem_object_free+0x7f/0xc0
drm_gem_object_free+0x5d/0x90
amdgpu_gem_userptr_ioctl+0x452/0x7e0
drm_ioctl_kernel+0x284/0x500
drm_ioctl+0x55e/0xa50
amdgpu_drm_ioctl+0xe3/0x1d0
</TASK>

Allocated by task 385:
kmem_cache_alloc+0x174/0x300
copy_process+0x32d1/0x6640
kernel_clone+0xcd/0x690

Freed by task 386:
kmem_cache_free+0x13b/0x550
mmu_interval_notifier_remove+0x4c8/0x610
amdgpu_hmm_unregister+0x47/0x90
amdgpu_gem_object_free+0x75/0xc0
drm_gem_object_free+0x5d/0x90
amdgpu_gem_userptr_ioctl+0x452/0x7e0
drm_ioctl_kernel+0x284/0x500
drm_ioctl+0x55e/0xa50
amdgpu_drm_ioctl+0xe3/0x1d0

The buggy address belongs to the object at ffff88801f17bb80
which belongs to the cache mm_struct of size 2016
The buggy address is located 128 bytes inside of
2016-byte region [ffff88801f17bb80, ffff88801f17c360)

The buggy address belongs to the physical page:
page:000000002c2a61bd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1f178
head:000000002c2a61bd order:3 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
memcg:ffff8880141aa301
flags: 0x100000000010200(slab|head|node=0|zone=1)
raw: 0100000000010200 ffff88800a44fc80 ffffea00006ca400 dead000000000004
raw: 0000000000000000 00000000800f000f 00000001ffffffff ffff8880141aa301
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88801f17bb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88801f17bb80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88801f17bc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88801f17bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88801f17bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================