Accessing emulated CXL memory is unstable

Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> · Tue, 10 Oct 2023 10:35:03 +0900

Hello folks,

I experienced strange application crashes/internal KVM errors
while playing with emulated type 3 CXL memory. I would like to know
if this is a real issue or I missed something during setup.

TL;DR: applications crash when accessing emulated CXL memory,
and stressing VM subsystem causes KVM internal error
(stressing via stress-ng --bigheap)

[ Details ]
qemu version is 8.1.50,
kernel version is 6.5.0,
ndctl version is 78.

the qemu command line is:
qemu-system-x86_64 \
       -accel kvm \
        -cpu host \
        -smp 1 \
        -M q35,cxl=on \
        -m 4G,maxmem=8G,slots=4 \
        -object memory-backend-ram,id=vmem0,share=on,size=4G \
        -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
        -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
        -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
        -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
        -kernel ../linux/arch/x86/boot/bzImage \
        -append 'console=ttyS0 nokaslr root=/dev/vda5' \
        -netdev user,id=n1 \
        -device virtio-net,netdev=n1,bus=pcie.0 \
        -blockdev
driver=qcow2,file.filename=./rocky.qcow2,file.driver=file,node-name=blk1
\
        -device virtio-blk,drive=blk1,bus=pcie.0 \
        -nographic

[ Before creating a CXL RAM Region ]

# free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       275Mi       3.6Gi       8.0Mi       120Mi       3.6Gi
Swap:             0B          0B          0B
# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0
node 0 size: 3923 MB
node 0 free: 3713 MB
node distances:
node   0
  0:  10

[ After creating a CXL RAM Region ]

# cxl enable-memdev mem0
# cxl create-region -t ram -m mem0 -w 1 -d decoder0.0

I enabled DEV_DAX_CXL=y and DEV_DAX_KMEM=y so when I create a CXL RAM Region,
so it is onlined right away.

# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       389Mi       7.5Gi       8.0Mi       122Mi       7.5Gi
Swap:             0B          0B          0B

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0
node 0 size: 3923 MB
node 0 free: 3631 MB
node 1 cpus:
node 1 size: 4096 MB
node 1 free: 4096 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

So far, everything looks fine,
until an application actually accesses this onlined CXL memory.

# numactl --membind=1 --show
[  139.495065] traps: numactl[493] trap invalid opcode ip:7f58d44b8f64
sp:7ffc2a273128 error:0 in libc.so.6[7f58d4428000+175000]
Illegal instruction (core dumped)

Hmm... it crashed, and it's 'invalid opcode'.
Is this because the fetched instruction is different from what's
written to memory during exec()?

I don't know yet, but let's just continue.

Let's stress the VM subsystem by allocating lots of heap memory:
# stress-ng --bigheap 1

stress-ng: info:  [496] defaulting to a 86400 second (1 day, 0.00
secs) run per stressor
stress-ng: info:  [496] dispatching hogs: 1 bigheap

^C
KVM internal error. Suberror: 1
extra data[0]: 0x0000000000000001
extra data[1]: 0x41c03127ae0f480f
extra data[2]: 0x8b4865ca010fc489
extra data[3]: 0x0000000000000031
extra data[4]: 0x0000000000000000
extra data[5]: 0x0000000000000000
extra data[6]: 0x0000000000000000
extra data[7]: 0x0000000000000000
emulation failure
RAX=0000000000000207 RBX=ffff88810cfb6a80 RCX=0000000000000000
RDX=0000000000000000
RSI=0000000000000207 RDI=00007f526d41b540 RBP=00007f526d41b540
RSP=ffffc90000a23d90
R8 =ffffc90000a23e88 R9 =0000000008000000 R10=0000000000000000
R11=0000000000000001
R12=0000000000000207 R13=00007f526d41b540 R14=0000000000000000
R15=ffff88810cfb5e80
RIP=ffffffff8103e626 RFL=00050246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0000 0000000000000000 ffffffff 00c00000
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 00007f526d41c740 ffffffff 00c00000
GS =0000 ffff88817bc00000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe0000001000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00007f526d41b740 CR3=0000000104e62000 CR4=00750ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=5a 14 00 00 0f 01 cb 4c 89 e2 48 89 ef 44 89 e0 48 c1 ea 20 <48>
0f ae 27 31 c0 41 89 c4 0f 01 ca 65 48 8b 04 25 c0 ac 02 00 83 a8 54
0a 00 00 01 be 00

This time I pressed CTRL+C and it caused KVM internal error 1.

Hmm on another try I increased the number of threads and just waited,
and then KVM internal error 3 occurred.

# stress-ng --bigheap 12
stress-ng: info:  [484] defaulting to a 86400 second (1 day, 0.00
secs) run per stressor
stress-ng: info:  [484] dispatching hogs: 12 bigheap

[   15.408360] traps: setroubleshootd[251] trap invalid opcode
ip:7fc7e92ff3eb sp:7ffc760f98a0 error:0 in
libpython3.9.so.1.0[7fc7e925a000+1b5000]
qemu-system-x86_64: virtio: bogus descriptor or out of resources
[   16.526518] virtio_blk virtio1: [vda] new size: 209715200 512-byte
logical blocks (107 GB/100 GiB)
[   16.611852] traps: systemd-udevd[131] trap invalid opcode
ip:7fe3c84bc58f sp:7ffefca00588 error:0 in
libc.so.6[7fe3c8428000+175000]
[   16.625903] traps: systemd[1] trap invalid opcode ip:7fe89c6ba1c0
sp:7ffdd9331b88 error:0 in libc.so.6[7fe89c628000+175000]

KVM internal error. Suberror: 3
extra data[0]: 0x0000000080000b0e
extra data[1]: 0x0000000000000031
extra data[2]: 0x0000000000000d82
extra data[3]: 0x0000000418442c90
extra data[4]: 0x0000000000000008
RAX=aaaaaaaaaaaaaaab RBX=0000000000000000 RCX=000000000000000c
RDX=0000000000000533
RSI=ffffffff828c80a0 RDI=ffffc900008bf018 RBP=ffffffff81f5bd54
RSP=ffffc900008bf000
R8 =ffffffff81f5bd50 R9 =0000000000000000 R10=0000000000000000
R11=0000000000000000
R12=000000000000000e R13=ffffffff81f5bd54 R14=ffffc900008bf0c8
R15=0000000000000001
RIP=ffffffff815dfbe8 RFL=00010096 [--S-AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0000 0000000000000000 ffffffff 00c00000
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 00007fe89c80cb40 ffffffff 00c00000
GS =0000 ffff88817bc00000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe0000001000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=ffffffff815dfbe8 CR3=0000000418442000 CR4=00750ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 57 41 56 <41>
55 41 54 55 53 48 83 ec 08 4c 89 04 24 48 85 d2 74 4d 49 89 ff 49 89
f5 48 89 d3 49 89

--
Hyeonggon