Micron Confidential Micron Confidential +AD4- From: Jonathan Cameron +ADw-Jonathan.Cameron+AEA-huawei.com+AD4- +AD4- Sent: Wednesday, November 27, 2024 10:05 PM +AD4- +AD4- +AD4- On Thu, 21 Nov 2024 10:18:41 +-0000 +AD4- Jonathan Cameron +ADw-Jonathan.Cameron+AEA-huawei.com+AD4- wrote: +AD4- +AD4- +AD4- The CXL specification release 3.2 is now available under a click +AD4- +AD4- through at +AD4- +AD4- +AD4- https://nam10.safelinks.protection.outlook.com/?url+AD0-https+ACU-3A+ACU-2F+ACU-2Fcom +AD4- p +AD4- +AD4- uteexpresslink.org+ACU-2Fcxl- +AD4- specification+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Cajayjoshi+ACU-40micron.com+ACU-7Ce59092c +AD4- 80eed4878d9cc08dd0f016a78+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU- +AD4- 7C0+ACU-7C0+ACU-7C638683221020661525+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJF +AD4- bXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiT +AD4- WFpbCIsIldUIjoyfQ+ACU-3D+ACU-3D+ACU-7C0+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-A6OYPhky94PnkzYn +AD4- 4bfB1usIFDQzR1GlY1QFK3hBVtY+ACU-3D+ACY-reserved+AD0-0 and it brings new shiny +AD4- toys. +AD4- +AD4- If anyone wants to play, basic emulation on my CXL QEMU staging tree +AD4- https://nam10.safelinks.protection.outlook.com/?url+AD0-https+ACU-3A+ACU-2F+ACU-2Fgitla +AD4- b.com+ACU-2Fjic23+ACU-2Fqemu+ACU-2F- +AD4- +ACU-2Fcommit+ACU-2Fe89b35d264c1bcc04807e7afab1254f35ffc8cb9+ACY-data+AD0-05+ACU-7 +AD4- C02+ACU-7Cajayjoshi+ACU-40micron.com+ACU-7Ce59092c80eed4878d9cc08dd0f016a7 +AD4- 8+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU-7C0+ACU-7C0+ACU-7C638683221020 +AD4- 676260+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi +AD4- OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ+ACU-3D+ACU-3D +AD4- +ACU-7C0+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-Un0fB5v+ACU-2BBKTnQPldKKoRwOpw9GrGdDwBrXm +AD4- JamKEIvA+ACU-3D+ACY-reserved+AD0-0 This is interesting. We are definitely trying this and let you know how it goes. +AD4- +AD4- Branch with a few other things on top is: +AD4- https://nam10.safelinks.protection.outlook.com/?url+AD0-https+ACU-3A+ACU-2F+ACU-2Fgitla +AD4- b.com+ACU-2Fjic23+ACU-2Fqemu+ACU-2F-+ACU-2Fcommits+ACU-2Fcxl-2024-11- +AD4- 27+ACY-data+AD0-05+ACU-7C02+ACU-7Cajayjoshi+ACU-40micron.com+ACU-7Ce59092c80eed4878d9 +AD4- cc08dd0f016a78+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU-7C0+ACU-7C0+ACU-7C +AD4- 638683221020684284+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJFbXB0eU1hcGk +AD4- iOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIj +AD4- oyfQ+ACU-3D+ACU-3D+ACU-7C0+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-V451+ACU-2BM9UKiC0RfBUviNTY3fZH +AD4- UGHdjJEgGuR0DowJZM+ACU-3D+ACY-reserved+AD0-0 +AD4- +AD4- Note that this currently doesn't produce real data. I have a plan / initial PoC / +AD4- hack to hook that up via an addition to the QEMU cache plugin and an +AD4- external tool to emulate the hotness tracker counting hardware. Will be a little +AD4- while before I get that finished, so in a meantime the above exercises the +AD4- driver. +AD4- +AD4- Jonathan +AD4- +AD4- +AD4- +AD4- +AD4- RFC reason +AD4- +AD4- - Whilst trace capture with a particular configuration is potentially useful +AD4- +AD4- the intent is that CXL HMU units will be used to drive various forms of +AD4- +AD4- hotpage migration for memory tiering setups. This driver doesn't do this +AD4- +AD4- (yet), but rather provides data capture etc for experimentation and +AD4- +AD4- for working out how to mostly put the allocations in the right place to +AD4- +AD4- start with by tuning applications. +AD4- +AD4- +AD4- +AD4- CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The +AD4- +AD4- intent of this is to provide a way to establish which units of memory +AD4- +AD4- (typically pages or larger) in CXL attached memory are hot. The +AD4- +AD4- implementation details and algorithm are all implementation defined. +AD4- +AD4- The specification simply describes the 'interface' which takes the +AD4- +AD4- form of ring buffer of hotness records in a PCI BAR and defined +AD4- +AD4- capability, configuration and status registers. +AD4- +AD4- +AD4- +AD4- The hardware may have constraints on what it can track, granularity +AD4- +AD4- etc and on how accurately it tracks (e.g. counter exhaustion, +AD4- +AD4- inaccurate trackers). Some of these constraints are discoverable from +AD4- +AD4- the hardware registers, others such as loss of accuracy have no +AD4- +AD4- universally accepted measures as they are typically access pattern +AD4- +AD4- dependent. Sadly it is very unlikely any hardware will implement a +AD4- +AD4- truly precise tracker given the large resource requirements for tracking at a +AD4- useful granularity. +AD4- +AD4- +AD4- +AD4- There are two fundamental operation modes: +AD4- +AD4- +AD4- +AD4- +ACo- Epoch based. Counters are checked after a period of time (Epoch) and +AD4- +AD4- if over a threshold added to the hotlist. +AD4- +AD4- +ACo- Always on. Counters run until a threshold is reached, after that the +AD4- +AD4- hot unit is added to the hotlist and the counter released. +AD4- +AD4- +AD4- +AD4- Counting can be filtered on: +AD4- +AD4- +AD4- +AD4- +ACo- Region of CXL DPA space (256MiB per bit in a bitmap). +AD4- +AD4- +ACo- Type of access - Trusted and non trusted or non trusted only, R/W/RW +AD4- +AD4- +AD4- +AD4- Sampling can be modified by: +AD4- +AD4- +AD4- +AD4- +ACo- Downsampling including potentially randomized downsampling. +AD4- +AD4- +AD4- +AD4- The driver presented here is intended to be useful in its own right +AD4- +AD4- but also to act as the first step of a possible path towards hotness +AD4- +AD4- monitoring based hot page migration. Those steps might look like. +AD4- +AD4- +AD4- +AD4- 1. Gather data - drivers provide telemetry like solutions to get that +AD4- +AD4- data. May be enhanced, for example in this driver by providing the +AD4- +AD4- HPA address rather than DPA Unit Address. Userspace can access enough +AD4- +AD4- information to do this so maybe not. +AD4- +AD4- 2. Userspace algorithm development, possibly combined with userspace +AD4- +AD4- triggered migration by PA. Working out how to use different levels +AD4- +AD4- of constrained hardware resources will be challenging. +AD4- +AD4- 3. Move those algorithms in kernel. Will require generalization across +AD4- +AD4- different hotpage trackers etc. +AD4- +AD4- +AD4- +AD4- So far this driver just gives access to the raw data. I will probably +AD4- +AD4- kick of a longer discussion on how to do adaptive sampling needed to +AD4- +AD4- actually use these units for tiering etc, sometime soon (if no one one +AD4- +AD4- else beats me too it). There is a follow up topic of how to +AD4- +AD4- virtualize this stuff for memory stranding cases (VM gets a fixed +AD4- +AD4- mixture of fast and slow memory and should do it's own tiering). +AD4- +AD4- +AD4- +AD4- More details in the Documentation patch but typical commands are: +AD4- +AD4- +AD4- +AD4- +ACQ-perf record -a -e cxl+AF8-hmu+AF8-mem0.0.0/epoch+AF8-type+AD0-0,access+AF8-type+AD0-6,+AFw- +AD4- +AD4- +AD4- +AD4- +AD4- hotness+AF8-threshold+AD0-1024,epoch+AF8-multiplier+AD0-4,epoch+AF8-scale+AD0-4,range+AF8-base+AD0-0,+AFw- +AD4- +AD4- range+AF8-size+AD0-1024,randomized+AF8-downsampling+AD0-0,downsampling+AF8-factor+AD0-32,+AFw- +AD4- +AD4- hotness+AF8-granual+AD0-12 +AD4- +AD4- +AD4- +AD4- +ACQ-perf report --dump-raw-traces +AD4- +AD4- +AD4- +AD4- Example output. With a counter+AF8-width of 16 (0x10) the least +AD4- +AD4- significant +AD4- +AD4- 4 bytes are the counter value and the unit index is bits 16-63. +AD4- +AD4- Here all units are over the threshold and the indexes are 0,1,2 etc. +AD4- +AD4- +AD4- +AD4- . ... CXL+AF8-HMU data: size 33512 bytes +AD4- +AD4- Header 0: units: 29c counter+AF8-width 10 +AD4- +AD4- Header 1 : deadbeef +AD4- +AD4- 0000000000000283 +AD4- +AD4- 0000000000010364 +AD4- +AD4- 0000000000020366 +AD4- +AD4- 000000000003033c +AD4- +AD4- 0000000000040343 +AD4- +AD4- 00000000000502ff +AD4- +AD4- 000000000006030d +AD4- +AD4- 000000000007031a +AD4- +AD4- +AD4- +AD4- Which will produce a list of hotness entries. +AD4- +AD4- Bits+AFs-N-1:0+AF0- counter value +AD4- +AD4- Bits+AFs-63:N+AF0- Unit ID (combine with unit size and DPA base +- HDM decoder +AD4- +AD4- config to get to a Host Physical Address) +AD4- +AD4- +AD4- +AD4- Specific RFC questions. +AD4- +AD4- - What should be in the header added to the aux buffer. +AD4- +AD4- Currently just the minimum is provided. Number of records +AD4- +AD4- and the counter width needed to decode them. +AD4- +AD4- - Should we reset the counters when doing sampling +ACI--F X+ACI- +AD4- +AD4- If the frequency is higher than the epoch we never see any hot units. +AD4- +AD4- If so, when should we reset them? +AD4- +AD4- +AD4- +AD4- Note testing has been light and on emulation only +- as perf tool is a +AD4- +AD4- pain to build on a striped back VM, build testing has all be on +AD4- +AD4- arm64 so far. The driver loads though on both arm64 and x86 so any +AD4- +AD4- problems are likely in the perf tool arch specific code which is build +AD4- +AD4- tested (on wrong machine) +AD4- +AD4- +AD4- +AD4- The QEMU emulation needs some cleanup, but I should be able to post +AD4- +AD4- that shortly to let people actually play with this. There are lots of +AD4- +AD4- open questions there on how 'right' we want the emulation to be and +AD4- +AD4- what counting uarch to emulate. +AD4- +AD4- +AD4- +AD4- Jonathan Cameron (4): +AD4- +AD4- cxl: Register devices for CXL Hotness Monitoring Units (CHMU) +AD4- +AD4- cxl: Hotness Monitoring Unit via a Perf AUX Buffer. +AD4- +AD4- perf: Add support for CXL Hotness Monitoring Units (CHMU) +AD4- +AD4- hwtrace: Document CXL Hotness Monitoring Unit driver +AD4- +AD4- +AD4- +AD4- Documentation/trace/cxl-hmu.rst +AHw- 197 +-+-+-+-+-+-+- +AD4- +AD4- Documentation/trace/index.rst +AHw- 1 +- +AD4- +AD4- drivers/cxl/Kconfig +AHw- 6 +- +AD4- +AD4- drivers/cxl/Makefile +AHw- 3 +- +AD4- +AD4- drivers/cxl/core/Makefile +AHw- 1 +- +AD4- +AD4- drivers/cxl/core/core.h +AHw- 1 +- +AD4- +AD4- drivers/cxl/core/hmu.c +AHw- 64 +-+- +AD4- +AD4- drivers/cxl/core/port.c +AHw- 2 +- +AD4- +AD4- drivers/cxl/core/regs.c +AHw- 14 +- +AD4- +AD4- drivers/cxl/cxl.h +AHw- 5 +- +AD4- +AD4- drivers/cxl/cxlpci.h +AHw- 1 +- +AD4- +AD4- drivers/cxl/hmu.c +AHw- 880 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- +AD4- +AD4- drivers/cxl/hmu.h +AHw- 23 +- +AD4- +AD4- drivers/cxl/pci.c +AHw- 26 +-- +AD4- +AD4- tools/perf/arch/arm/util/auxtrace.c +AHw- 58 +-+- +AD4- +AD4- tools/perf/arch/x86/util/auxtrace.c +AHw- 76 +-+-+- +AD4- +AD4- tools/perf/util/Build +AHw- 1 +- +AD4- +AD4- tools/perf/util/auxtrace.c +AHw- 4 +- +AD4- +AD4- tools/perf/util/auxtrace.h +AHw- 1 +- +AD4- +AD4- tools/perf/util/cxl-hmu.c +AHw- 367 +-+-+-+-+-+-+-+-+-+-+-+- +AD4- +AD4- tools/perf/util/cxl-hmu.h +AHw- 18 +- +AD4- +AD4- 21 files changed, 1748 insertions(+-), 1 deletion(-) create mode +AD4- +AD4- 100644 Documentation/trace/cxl-hmu.rst create mode 100644 +AD4- +AD4- drivers/cxl/core/hmu.c create mode 100644 drivers/cxl/hmu.c create +AD4- +AD4- mode 100644 drivers/cxl/hmu.h create mode 100644 +AD4- +AD4- tools/perf/util/cxl-hmu.c create mode 100644 +AD4- +AD4- tools/perf/util/cxl-hmu.h +AD4- +AD4-