On Mon, Dec 30, 2019 at 3:37 PM Harry Pan <harry.pan@xxxxxxxxx> wrote: > > RAPL MMIO support depends on RAPL common driver, in case a new generation > of CPU is booting but not in the RAPL support list, the processor_thermal > driver invokes CPU hotplug API to enforce RAPL common driver adding new > RAPL domain which would cause kernel crash by null pointer dereference > because the internal RAPL domain resource mapping is not initialized after > the common init. > > Add error handling to detect non initialized RAPL domain resource mapping > and return error code to the caller; such that, it avoids early crash for > new CPU and leave error messages through processor_thermal driver. > > Before: > [ 4.188566] BUG: kernel NULL pointer dereference, address: 0000000000000020 > ...snip... > [ 4.189555] RIP: 0010:rapl_add_package+0x223/0x574 > [ 4.189555] Code: b5 a0 31 c0 49 8b 4d 78 48 01 d9 48 8b 0c c1 49 89 4c c6 10 48 ff c0 48 83 f8 05 75 e7 49 83 ff 03 75 15 48 8b 05 09 bc 18 01 <8b> 70 20 41 89 b6 0c 05 00 00 85 f6 75 1a 49 81 c6 18 9 > [ 4.189555] RSP: 0000:ffffb3adc00b3d90 EFLAGS: 00010246 > [ 4.189555] RAX: 0000000000000000 RBX: 0000000000000098 RCX: 0000000000000000 > [ 4.267161] usb 1-1: New USB device found, idVendor=2109, idProduct=2812, bcdDevice= b.e0 > [ 4.189555] RDX: 0000000000001000 RSI: 0000000000000000 RDI: ffff9340caafd000 > [ 4.189555] RBP: ffffb3adc00b3df8 R08: ffffffffa0246e28 R09: ffff9340caafc000 > [ 4.189555] R10: 000000000000024a R11: ffffffff9ff1f6f2 R12: 00000000ffffffed > [ 4.189555] R13: ffff9340caa94800 R14: ffff9340caafc518 R15: 0000000000000003 > [ 4.189555] FS: 0000000000000000(0000) GS:ffff9340ce200000(0000) knlGS:0000000000000000 > [ 4.189555] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 4.189555] CR2: 0000000000000020 CR3: 0000000302c14001 CR4: 00000000003606f0 > [ 4.189555] Call Trace: > [ 4.189555] ? __switch_to_asm+0x40/0x70 > [ 4.189555] rapl_mmio_cpu_online+0x47/0x64 > [ 4.189555] ? rapl_mmio_write_raw+0x33/0x33 > [ 4.281059] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 > [ 4.189555] cpuhp_invoke_callback+0x29f/0x66f > [ 4.189555] ? __schedule+0x46d/0x6a0 > [ 4.189555] cpuhp_thread_fun+0xb9/0x11c > [ 4.189555] smpboot_thread_fn+0x17d/0x22f > [ 4.297006] usb 1-1: Product: USB2.0 Hub > [ 4.189555] ? cpu_report_death+0x43/0x43 > [ 4.189555] kthread+0x137/0x13f > [ 4.189555] ? cpu_report_death+0x43/0x43 > [ 4.189555] ? kthread_blkcg+0x2e/0x2e > [ 4.312951] usb 1-1: Manufacturer: VIA Labs, Inc. > [ 4.189555] ret_from_fork+0x1f/0x40 > [ 4.189555] Modules linked in: > [ 4.189555] CR2: 0000000000000020 > [ 4.189555] ---[ end trace 01bb812aabc791f4 ]--- > > After: > [ 0.787125] intel_rapl_common: driver does not support CPU family 6 model 166 > ...snip... > [ 4.245273] proc_thermal 0000:00:04.0: failed to add RAPL MMIO interface > > Note: > This example above is on a v5.4 branch without below two CML commits yet: > commit f84fdcbc8ec0 ("powercap/intel_rapl: add support for Cometlake desktop") > commit cae478114fbe ("powercap/intel_rapl: add support for CometLake Mobile") > > Fixes: 555c45fe0d04 ("int340X/processor_thermal_device: add support for MMIO RAPL") > > Cc: <stable@xxxxxxxxxxxxxxx> # v5.3+ > Signed-off-by: Harry Pan <harry.pan@xxxxxxxxx> Applied as a fix for 5.5-rc with rewritten subject and changelog (new subject: "powercap: intel_rapl: add NULL pointer check to rapl_mmio_cpu_online()"). > --- > > drivers/powercap/intel_rapl_common.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c > index 318d023a6a11..aa0a8de413b1 100644 > --- a/drivers/powercap/intel_rapl_common.c > +++ b/drivers/powercap/intel_rapl_common.c > @@ -1294,6 +1294,9 @@ struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv) > struct cpuinfo_x86 *c = &cpu_data(cpu); > int ret; > > + if (!rapl_defaults) > + return ERR_PTR(-ENODEV); > + > rp = kzalloc(sizeof(struct rapl_package), GFP_KERNEL); > if (!rp) > return ERR_PTR(-ENOMEM); > -- > 2.24.1 >