Hi On Mon, 2023-02-27 09:16:21 +0000, Uwe Kleine-König wrote: > > Hello, > > On Sun, Feb 26, 2023 at 06:48:30PM -0800, Munehisa Kamata wrote: > > On Sun, 2023-02-26 09:17:52 +0000, Uwe Kleine-König <u.kleine-koenig@xxxxxxxxxxxxxx> wrote: > > > On Sat, Feb 25, 2023 at 05:37:21PM -0800, Munehisa Kamata wrote: > > > > Zero-initialize the on-stack structure to avoid unexpected behaviors. Some > > > > drivers may not set or initialize all the values in pwm_state through their > > > > .get_state() callback and therefore some random values may remain there and > > > > be set into pwm->state eventually. > > > > > > > > This actually caused regression on ODROID-N2+ as reported in [1]; kernel > > > > fails to boot due to random panic or hang-up. > > > > > > > > [1] https://forum.odroid.com/viewtopic.php?f=177&t=46360 > > > > > > Looking through the report I wonder what actually made the machine fail > > > to boot. Doesn't this paper over a problem that should be fixed (also) > > > somewhere else? > > > > Yes, you're right and I think the commit message could have described more > > details. This patch is for ensuring all drivers see zeroed state same as > > before, but I still don't fully understand how it ends up such random-ish > > crashes. There could be another or bigger problem that should be fixed. > > > > > Which driver is the one that the problem occur for? > > > > It's pwm-meson and seems to be caused by random polarity value somehow. If > > meson_pwm_get_state() sets polarity to zero instead, I don't see the > > problem. According the comment, looks like the driver does not set polarity > > by design. > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pwm/pwm-meson.c?h=v6.2&id=c9c3395d5e3dcc6daee66c6908354d47bf98cb0c#n9 > > > > Before commit c73a3107624d, the memory was kcalloc'ed and always zeroed, > > but I don't know if the driver was (is) assuming that. I'm adding Meson SoC > > people to CC. > > > > Apart from how polarity should be handled in the driver, I'm very puzzled > > by the crashes I've observed so far. There seems to be some patterns, but > > they don't seem obviously related to PWM. > > > > [ 1.360542] soc soc0: Amlogic Meson G12B (S922X) Revision 29:c (40:2) Detected > > [ 1.363906] Insufficient stack space to handle exception! > > [ 1.363913] ESR: 0x0000000096000047 -- DABT (current EL) > > [ 1.363917] FAR: 0xffff800009a47ff0 > > [ 1.363920] Task stack: [0xffff800009a48000..0xffff800009a4c000] > > [ 1.363923] IRQ stack: [0xffff8000099a8000..0xffff8000099ac000] > > [ 1.363927] Overflow stack: [0xffff000077b76100..0xffff000077b77100] > > [ 1.363931] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.2.0-odroid-arm64 #47 > > [ 1.363938] Hardware name: Hardkernel ODROID-N2Plus (DT) > > [ 1.363941] pstate: 400000c5 (nZcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > [ 1.363947] pc : __do_kernel_fault+0x4/0x180 > > [ 1.363961] lr : do_page_fault+0xd0/0x3d0 > > [ 1.363968] sp : ffff800009a48020 > > [ 1.363970] x29: ffff800009a48020 x28: ffff000002948000 x27: 0000000000000000 > > [ 1.363980] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 > > [ 1.363987] x23: 0000000096000004 x22: ffff800009a48110 x21: 00000000ffffffff > > [ 1.363994] x20: ffff800009a48110 x19: 00000000ffffffff x18: 000000000000001c > > [ 1.364001] x17: 00000000863047f6 x16: ffff800009860a80 x15: 0000000000000003 > > [ 1.364008] x14: 00000000000003ef x13: 0000000000000000 x12: 0000000000000279 > > [ 1.364015] x11: 0000000000000001 x10: 0000000000000001 x9 : 0000000000000400 > > [ 1.364021] x8 : ffff000077b7fc40 x7 : ffff000077b7fbc0 x6 : ffff8000094f7990 > > [ 1.364028] x5 : 0000000000000000 x4 : ffff000002948000 x3 : 0001000000000000 > > [ 1.364035] x2 : ffff800009a48110 x1 : 0000000096000004 x0 : 00000000ffffffff > > [ 1.364043] Kernel panic - not syncing: kernel stack overflow > > [ 1.364047] SMP: stopping secondary CPUs > > > > Another example: > > > > [ 1.360997] soc soc0: Amlogic Meson G12B (S922X) Revision 29:c (40:2) Detected > > [ 1.364333] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 > > [ 1.364435] Unable to handle kernel paging request at virtual address 0000000008003388 > > [ 1.367470] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP > > [ 1.367483] Modules linked in: > > [ 1.367490] CPU: 2 PID: 66 Comm: kworker/u12:1 Not tainted 6.2.0-odroid-arm64 #47 > > [ 1.367498] Hardware name: Hardkernel ODROID-N2Plus (DT) > > [ 1.367502] Workqueue: events_unbound async_run_entry_fn > > [ 1.367516] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > [ 1.367523] pc : arch_timer_handler_phys+0x0/0x44 > > [ 1.367535] lr : handle_percpu_devid_irq+0x84/0x140 > > [ 1.367543] sp : ffff8000099abf70 > > [ 1.367546] x29: ffff8000099abf70 x28: ffff000002a40000 x27: 0000000000000006 > > [ 1.367556] x26: ffff800009d208a4 x25: ffff800009d20b44 x24: 0000000000000008 > > [ 1.367565] x23: ffff8000094f8b50 x22: 000000000000000b x21: ffff000002829000 > > [ 1.367573] x20: ffff800008f89fb0 x19: ffff00000282a600 x18: ffff0000021792e8 > > [ 1.367581] x17: ffff80006e685000 x16: ffff8000099ac000 x15: 0000000000004000 > > [ 1.367589] x14: ffff800009d27bde x13: ffff800009d2887b x12: 0000000000000308 > > [ 1.367597] x11: 0000000000000a92 x10: 0000000000000068 x9 : ef01a5948f440b31 > > [ 1.367606] x8 : 0000000000003273 x7 : ffff800009d262a8 x6 : 000000003754d375 > > [ 1.367614] x5 : ffff80006e685000 x4 : ffff8000099abf70 x3 : ffff80006e685000 > > [ 1.367622] x2 : ffff800008c26c30 x1 : ffff000077b83a00 x0 : 000000000000000b > > [ 1.367630] Call trace: > > [ 1.367633] arch_timer_handler_phys+0x0/0x44 > > [ 1.367641] generic_handle_domain_irq+0x2c/0x44 > > [ 1.367650] gic_handle_irq+0x44/0xc4 > > [ 1.367659] call_on_irq_stack+0x2c/0x5c > > [ 1.367666] do_interrupt_handler+0x80/0x84 > > [ 1.367672] el1_interrupt+0x34/0x70 > > [ 1.367682] el1h_64_irq_handler+0x18/0x2c > > [ 1.367686] el1h_64_irq+0x64/0x68 > > [ 1.367690] HUF_decompress4X1_usingDTable_internal_default+0x2fc/0xd60 > > [ 1.367702] HUF_decompress4X_hufOnly_wksp_bmi2+0xec/0x140 > > [ 1.367711] ZSTD_decodeLiteralsBlock+0x580/0x630 > > [ 1.367717] ZSTD_decompressBlock_internal.part.0+0x5c/0x1b4 > > [ 1.367723] ZSTD_decompressBlock_internal+0x1c/0x30 > > [ 1.367729] ZSTD_decompressContinue.part.0+0x364/0x444 > > [ 1.367734] ZSTD_decompressContinueStream+0x98/0x180 > > [ 1.367738] ZSTD_decompressStream+0x5b0/0x8c0 > > [ 1.367743] zstd_decompress_stream+0x10/0x20 > > [ 1.367751] unzstd+0x290/0x37c > > [ 1.367760] unpack_to_rootfs+0x174/0x298 > > [ 1.367767] do_populate_rootfs+0x84/0x1dc > > [ 1.367773] async_run_entry_fn+0x34/0x150 > > [ 1.367778] process_one_work+0x1d0/0x320 > > [ 1.367785] worker_thread+0x14c/0x444 > > > > Perhaps, the driver or the PWM core could do something wrong based on the > > invalid polarity and corrupt certain memory location, but I'm still not > > able to relate the random value to these crashes. > > I can imagine that the compiler creates code (a jump table) that relies > on .polarity being either PWM_POLARITY_NORMAL or PWM_POLARITY_INVERSED > and that exibits UB if that isn't true. (I think the compiler is allowed > to do that.) > > Looking at the driver, there are several problems: > > - The driver does just > > if (state->polarity == PWM_POLARITY_INVERSED) > duty = period - duty; > > which is broken because each period is supposed to start with the > active part. (You could argue this being ridiculous and I'd agree. > But that's what we have and just doing it differently in a driver is > wrong.) The fix is to check if the hardware actually emits normal or > inversed polarity and refuse the other one. Then in .get_state() set > the appropriate polarity. Looking at the comment in the source, I'm not sure if such checking is possible with this hardware. >From pwm-meson.c: * The hardware has no "polarity" setting. This driver reverses the period * cycles (the low length is inverted with the high length) for * PWM_POLARITY_INVERSED. This means that .get_state cannot read the polarity * from the hardware. As far as I see, there seems to be some drivers that hard-code polarity in .get_state() without getting it from hardware. Perhaps, we could do that instead? Honestly, I don't understand how it's a bad idea here. > - In meson_pwm_calc() we have: > > unsigned int duty, ...; > > duty = state->duty_cycle; > > which is wrong for state->duty_cycle > U32_MAX. The same for period. > > - After period is fixed to be proper u64, fin_freq * (u64)period might > overflow. > > - harmless: The check > > duty_cnt = div64_u64(fin_freq * (u64)duty, > NSEC_PER_SEC * (pre_div + 1)); > if (duty_cnt > 0xffff) > ... > > never triggers, as duty <= period (after fixing the first issue) and > we already know that > > div64_u64(fin_freq * (u64)period, NSEC_PER_SEC * (pre_div + 1)) <= 0xffff. Ah, I believe this requires a separate fix. Regards, Munehisa