On Wed, Jun 21, 2023 at 8:24 PM Sebastian Wick <sebastian.wick@xxxxxxxxxx> wrote: > > On Fri, May 26, 2023 at 6:21 PM Aravind Iddamsetty > <aravind.iddamsetty@xxxxxxxxx> wrote: > > > > Our hardware supports RAS(Reliability, Availability, Serviceability) by > > exposing a set of error counters which can be used by observability > > tools to take corrective actions or repairs. Traditionally there were > > being exposed via PMU (for relative counters) and sysfs interface (for > > absolute value) in our internal branch. But, due to the limitations in > > this approach to use two interfaces and also not able to have an event > > based reporting or configurability, an alternative approach to try > > netlink was suggested by community for drm subsystem wide UAPI for RAS > > and telemetry as discussed in [1]. > > > > This [1] is the inspiration to this series. It uses the generic > > netlink(genl) family subsystem and exposes a set of commands that can > > be used by every drm driver, the framework provides a means to have > > custom commands too. Each drm driver instance in this example xe driver > > instance registers a family and operations to the genl subsystem through > > which it enumerates and reports the error counters. An event based > > notification is also supported to which userpace can subscribe to and > > be notified when any error occurs and read the error counter this avoids > > continuous polling on error counter. This can also be extended to > > threshold based notification. > > Be aware that netlink can be quite awkward in user space because it's > attached to the netns while the device is in the mount ns and there > are special rules for netlink regarding namespacing. I agree, we need to be sure this works in all common deployments, mainly dockers and kubernetes, before deciding to go down this path. Oded > > > [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html > > > > this series is on top of https://patchwork.freedesktop.org/series/116181/ > > > > Below is an example tool drm_ras which demonstrates the use of the > > supported commands. The tool will be sent to ML with the subject > > "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" > > > > read single error counter: > > > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 > > counter value 0 > > > > read all error counters: > > > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > > name config-id counter > > > > error-gt0-correctable-guc 0x0000000000000001 0 > > error-gt0-correctable-slm 0x0000000000000003 0 > > error-gt0-correctable-eu-ic 0x0000000000000004 0 > > error-gt0-correctable-eu-grf 0x0000000000000005 0 > > error-gt0-fatal-guc 0x0000000000000009 0 > > error-gt0-fatal-slm 0x000000000000000d 0 > > error-gt0-fatal-eu-grf 0x000000000000000f 0 > > error-gt0-fatal-fpu 0x0000000000000010 0 > > error-gt0-fatal-tlb 0x0000000000000011 0 > > error-gt0-fatal-l3-fabric 0x0000000000000012 0 > > error-gt0-correctable-subslice 0x0000000000000013 0 > > error-gt0-correctable-l3bank 0x0000000000000014 0 > > error-gt0-fatal-subslice 0x0000000000000015 0 > > error-gt0-fatal-l3bank 0x0000000000000016 0 > > error-gt0-sgunit-correctable 0x0000000000000017 0 > > error-gt0-sgunit-nonfatal 0x0000000000000018 0 > > error-gt0-sgunit-fatal 0x0000000000000019 0 > > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 > > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 > > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 > > error-gt0-soc-fatal-punit 0x000000000000001d 0 > > error-gt0-soc-fatal-psf-0 0x000000000000001e 0 > > error-gt0-soc-fatal-psf-1 0x000000000000001f 0 > > error-gt0-soc-fatal-psf-2 0x0000000000000020 0 > > error-gt0-soc-fatal-cd0 0x0000000000000021 0 > > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 > > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 > > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 > > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 > > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 > > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 > > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 > > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 > > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 > > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 > > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 > > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 > > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 > > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 > > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 > > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 > > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 > > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 > > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 > > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 > > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 > > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 > > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 > > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 > > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 > > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 > > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 > > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 > > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 > > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 > > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 > > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 > > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 > > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 > > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 > > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 > > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 > > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 > > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 > > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 > > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 > > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 > > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 > > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 > > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 > > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 > > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 > > error-gt1-correctable-guc 0x1000000000000001 0 > > error-gt1-correctable-slm 0x1000000000000003 0 > > error-gt1-correctable-eu-ic 0x1000000000000004 0 > > error-gt1-correctable-eu-grf 0x1000000000000005 0 > > error-gt1-fatal-guc 0x1000000000000009 0 > > error-gt1-fatal-slm 0x100000000000000d 0 > > error-gt1-fatal-eu-grf 0x100000000000000f 0 > > error-gt1-fatal-fpu 0x1000000000000010 0 > > error-gt1-fatal-tlb 0x1000000000000011 0 > > error-gt1-fatal-l3-fabric 0x1000000000000012 0 > > error-gt1-correctable-subslice 0x1000000000000013 0 > > error-gt1-correctable-l3bank 0x1000000000000014 0 > > error-gt1-fatal-subslice 0x1000000000000015 0 > > error-gt1-fatal-l3bank 0x1000000000000016 0 > > error-gt1-sgunit-correctable 0x1000000000000017 0 > > error-gt1-sgunit-nonfatal 0x1000000000000018 0 > > error-gt1-sgunit-fatal 0x1000000000000019 0 > > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 > > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 > > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 > > error-gt1-soc-fatal-punit 0x100000000000001d 0 > > error-gt1-soc-fatal-psf-0 0x100000000000001e 0 > > error-gt1-soc-fatal-psf-1 0x100000000000001f 0 > > error-gt1-soc-fatal-psf-2 0x1000000000000020 0 > > error-gt1-soc-fatal-cd0 0x1000000000000021 0 > > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 > > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 > > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 > > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 > > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 > > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 > > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 > > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 > > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 > > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 > > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 > > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 > > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 > > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 > > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 > > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 > > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 > > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 > > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 > > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 > > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 > > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 > > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 > > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 > > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 > > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 > > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 > > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 > > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 > > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 > > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 > > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 > > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 > > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 > > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 > > > > wait on a error event: > > > > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 > > waiting for error event > > error event received > > counter value 0 > > > > list all errors: > > > > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 > > name config-id > > > > error-gt0-correctable-guc 0x0000000000000001 > > error-gt0-correctable-slm 0x0000000000000003 > > error-gt0-correctable-eu-ic 0x0000000000000004 > > error-gt0-correctable-eu-grf 0x0000000000000005 > > error-gt0-fatal-guc 0x0000000000000009 > > error-gt0-fatal-slm 0x000000000000000d > > error-gt0-fatal-eu-grf 0x000000000000000f > > error-gt0-fatal-fpu 0x0000000000000010 > > error-gt0-fatal-tlb 0x0000000000000011 > > error-gt0-fatal-l3-fabric 0x0000000000000012 > > error-gt0-correctable-subslice 0x0000000000000013 > > error-gt0-correctable-l3bank 0x0000000000000014 > > error-gt0-fatal-subslice 0x0000000000000015 > > error-gt0-fatal-l3bank 0x0000000000000016 > > error-gt0-sgunit-correctable 0x0000000000000017 > > error-gt0-sgunit-nonfatal 0x0000000000000018 > > error-gt0-sgunit-fatal 0x0000000000000019 > > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a > > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b > > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c > > error-gt0-soc-fatal-punit 0x000000000000001d > > error-gt0-soc-fatal-psf-0 0x000000000000001e > > error-gt0-soc-fatal-psf-1 0x000000000000001f > > error-gt0-soc-fatal-psf-2 0x0000000000000020 > > error-gt0-soc-fatal-cd0 0x0000000000000021 > > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 > > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 > > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 > > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 > > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 > > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 > > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 > > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 > > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a > > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b > > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c > > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d > > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e > > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f > > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 > > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 > > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 > > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 > > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 > > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 > > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 > > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 > > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 > > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 > > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a > > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b > > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c > > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d > > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e > > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f > > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 > > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 > > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 > > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 > > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 > > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 > > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 > > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 > > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 > > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 > > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a > > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b > > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c > > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d > > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e > > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f > > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 > > error-gt1-correctable-guc 0x1000000000000001 > > error-gt1-correctable-slm 0x1000000000000003 > > error-gt1-correctable-eu-ic 0x1000000000000004 > > error-gt1-correctable-eu-grf 0x1000000000000005 > > error-gt1-fatal-guc 0x1000000000000009 > > error-gt1-fatal-slm 0x100000000000000d > > error-gt1-fatal-eu-grf 0x100000000000000f > > error-gt1-fatal-fpu 0x1000000000000010 > > error-gt1-fatal-tlb 0x1000000000000011 > > error-gt1-fatal-l3-fabric 0x1000000000000012 > > error-gt1-correctable-subslice 0x1000000000000013 > > error-gt1-correctable-l3bank 0x1000000000000014 > > error-gt1-fatal-subslice 0x1000000000000015 > > error-gt1-fatal-l3bank 0x1000000000000016 > > error-gt1-sgunit-correctable 0x1000000000000017 > > error-gt1-sgunit-nonfatal 0x1000000000000018 > > error-gt1-sgunit-fatal 0x1000000000000019 > > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a > > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b > > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c > > error-gt1-soc-fatal-punit 0x100000000000001d > > error-gt1-soc-fatal-psf-0 0x100000000000001e > > error-gt1-soc-fatal-psf-1 0x100000000000001f > > error-gt1-soc-fatal-psf-2 0x1000000000000020 > > error-gt1-soc-fatal-cd0 0x1000000000000021 > > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 > > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 > > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 > > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 > > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 > > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 > > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 > > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 > > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a > > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b > > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c > > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d > > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e > > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f > > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 > > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 > > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 > > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 > > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 > > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 > > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 > > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 > > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 > > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 > > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a > > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b > > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c > > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d > > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e > > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f > > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 > > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 > > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 > > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 > > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 > > > > Cc: Alex Deucher <alexander.deucher@xxxxxxx> > > Cc: David Airlie <airlied@xxxxxxxxx> > > Cc: Daniel Vetter <daniel@xxxxxxxx> > > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> > > Cc: Oded Gabbay <ogabbay@xxxxxxxxxx> > > > > > > Aravind Iddamsetty (5): > > drm/netlink: Add netlink infrastructure > > drm/xe/RAS: Register a genl netlink family > > drm/xe/RAS: Expose the error counters > > drm/netlink: define multicast groups > > drm/xe/RAS: send multicast event on occurrence of an error > > > > drivers/gpu/drm/xe/Makefile | 1 + > > drivers/gpu/drm/xe/xe_device.c | 3 + > > drivers/gpu/drm/xe/xe_device_types.h | 2 + > > drivers/gpu/drm/xe/xe_irq.c | 32 ++ > > drivers/gpu/drm/xe/xe_module.c | 2 + > > drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ > > drivers/gpu/drm/xe/xe_netlink.h | 14 + > > include/uapi/drm/drm_netlink.h | 81 +++++ > > include/uapi/drm/xe_drm.h | 64 ++++ > > 9 files changed, 725 insertions(+) > > create mode 100644 drivers/gpu/drm/xe/xe_netlink.c > > create mode 100644 drivers/gpu/drm/xe/xe_netlink.h > > create mode 100644 include/uapi/drm/drm_netlink.h > > > > -- > > 2.25.1 > > >