Re: [PATCH 00/18] CXL RAM and the 'Soft Reserved' => 'System RAM' default

Dan Williams <dan.j.williams@xxxxxxxxx> · Tue, 14 Feb 2023 10:39:42 -0800

Gregory Price wrote:
> On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> > Summary:
> > --------
> > 
> > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > regions, and more routinely, assembling a region from an existing
> > configuration established by platform-firmware. The latter is motivated
> > by CXL memory RAS (Reliability, Availability and Serviceability)
> > support, that requires associating device events with System Physical
> > Address ranges and vice versa.
> > 
> 
> Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> should report this before I dive further in.
> 
> Earlier i was carrying the DOE patches and others, I've dropped most of
> that to make sure i could replicate on the base kernel and qemu images
> 
> QEMU branch: 
> https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> this is a little out of date at this point i think? but it shouldn't
> matter, the results are the same regardless of what else i pull in.
> 
> Kernel branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

Note that I acted on this feedback from Greg to break out a fix and
merge it for v6.2-final

http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@xxxxxxxxx

...i.e. you are missing at least the passthrough decoder fix, but that
would show up as a region creation failure not a QEMU crash.

So I would move to testing cxl/next.

[..]
> Lets attempt to use the memory
> [root@fedora ~]# numactl --membind=1 python
> KVM internal error. Suberror: 3
> extra data[0]: 0x0000000080000b0e
> extra data[1]: 0x0000000000000031
> extra data[2]: 0x0000000000000d81
> extra data[3]: 0x0000000390074ac0
> extra data[4]: 0x0000000000000010
> RAX=0000000080000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000001
> RSI=0000000000000000 RDI=0000000390074000 RBP=ffffac1c4067bca0 RSP=ffffac1c4067bc88
> R8 =0000000000000000 R9 =0000000000000001 R10=0000000000000000 R11=0000000000000000
> R12=0000000000000000 R13=ffff99eed0074000 R14=0000000000000000 R15=0000000000000000
> RIP=ffffffff812b3d62 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 0000000000000000 ffffffff 00c00000
> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0000 0000000000000000 ffffffff 00c00000
> FS =0000 0000000000000000 ffffffff 00c00000
> GS =0000 ffff99ec3bc00000 ffffffff 00c00000
> LDT=0000 0000000000000000 ffffffff 00c00000
> TR =0040 fffffe1d13135000 00004087 00008b00 DPL=0 TSS64-busy
> GDT=     fffffe1d13133000 0000007f
> IDT=     fffffe0000000000 00000fff
> CR0=80050033 CR2=ffffffff812b3d62 CR3=0000000390074000 CR4=000006f0
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000fffe0ff0 DR7=0000000000000400
> EFER=0000000000000d01
> Code=5d 9c 01 0f b7 db 48 09 df 48 0f ba ef 3f 0f 22 df 0f 1f 00 <5b> 41 5c 41 5d 5d c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 cd 0d 76 01
> 

At first glance that looks like a QEMU issue, but I would capture a cxl
list -vvv before attempting to use the memory just to verify the decoder
setup looks sane.

> 
> 
> I also tested lowering the ram sizes (2GB ram, 1GB "CXL") to see if
> there's something going on with the PCI hole or something, but no, same
> results.
> 
> Double checked if there was an issue using a single root port so i
> registered a second one - same results.
> 
> 
> In prior tests i accessed the memory directly via devmem2
> 
> This still works when mapping the memory manually
> 
> [root@fedora map] ./map_memory.sh
> echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size
> echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> echo 4096 > /sys/bus/cxl/devices/region0/interleave_granularity
> echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> echo 0x40000000 > /sys/bus/cxl/devices/region0/size
> echo decoder2.0 > /sys/bus/cxl/devices/region0/target0
> echo 1 > /sys/bus/cxl/devices/region0/commit
> 
> 
> [root@fedora devmem]# ./devmem2 0x290000000 w 0x12345678
> /dev/mem opened.
> Memory mapped at address 0x7fb4d4ed3000.
> Value at address 0x290000000 (0x7fb4d4ed3000): 0x0
> Written 0x12345678; readback 0x12345678

Likely it is sensitive to crossing an interleave threshold.

> This kind of implies there's a disagreement about the state of memory
> between linux and qemu.
> 
> 
> but even just onlining a region produces memory usage:
> 
> [root@fedora ~]# cat /sys/bus/node/devices/node1/meminfo
> Node 1 MemTotal:        1048576 kB
> Node 1 MemFree:         1048112 kB
> Node 1 MemUsed:             464 kB
> 
> 
> Which I would expect to set off some fireworks.
> 
> Maybe an issue at the NUMA level? I just... i have no idea.
> 
> 
> I will need to dig through the email chains to figure out what others
> have been doing that i'm missing.  Everything *looks* nominal, but the
> reactors are exploding so... ¯\_(ツ)_/¯
> 
> I'm not sure where to start here, but i'll bash my face on the keyboard
> for a bit until i have some ideas.

Not ruling out the driver yet, but Fan's tests with hardware has me
leaning more towards QEMU.