Hello Rafael, > -----Original Message----- > From: Borah, Chaitanya Kumar > Sent: Wednesday, October 11, 2023 10:19 PM > To: Wysocki, Rafael J <rafael.j.wysocki@xxxxxxxxx> > Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar > <Suresh.Kumar.Kurmi@xxxxxxxxx>; Saarinen, Jani <jani.saarinen@xxxxxxxxx> > Subject: RE: Regression in linux-next > > Hello Rafael, > > > -----Original Message----- > > From: Wysocki, Rafael J <rafael.j.wysocki@xxxxxxxxx> > > Sent: Wednesday, October 11, 2023 9:44 PM > > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx> > > Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar > > <suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani > > <jani.saarinen@xxxxxxxxx> > > Subject: Re: Regression in linux-next > > > > Hi, > > > > On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote: > > > Hello Rafael, > > > > > >> -----Original Message----- > > >> From: Wysocki, Rafael J <rafael.j.wysocki@xxxxxxxxx> > > >> Sent: Tuesday, October 10, 2023 12:54 AM > > >> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx> > > >> Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar > > >> <suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani > > >> <jani.saarinen@xxxxxxxxx> > > >> Subject: Re: Regression in linux-next > > >> > > >> Hi, > > >> > > >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote: > > >>> Hello Rafael > > >>> > > >>>> Thanks for the report, I think that this is a lockdep assertion failing. > > >>>> If that is correct, it should be straightforward to fix. > > >>>> I'll take care of this early next week. > > >>>> Thanks! > > >>> Thank you for your response. Please let us know when a fix is available. > > >> It should be fixed in linux-next from today, by this commit: > > >> > > >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux- > > >> pm.git/commit/?h=linux- > > >> next&id=b44444027ce7714f309e96b804b7fb088a40d708 > > >> > > >> Thanks! > > > Thanks a lot for the fix. This seems to have fixed the issue in most > > > of the > > machines but we are still seeing a similar problem in few of the machines. > > > > Thanks for reporting this! > > > > > > > This has a different call stack but seems to be from the same > > > thermal subsystem. Full logs in [1] > > > > > > <4>[ 4.392015] WARNING: CPU: 1 PID: 306 at > > drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70 > > > <4>[ 4.392022] Modules linked in: x86_pkg_temp_thermal coretemp > > kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass > > crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801 > > mei_me pps_core mei i2c_smbus wmi > > > <4>[ 4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5- > > next-20231010-next-20231010-gc0a6edb636cb+ #1 > > > <4>[ 4.392061] Hardware name: System manufacturer System Product > > Name/Z170M-PLUS, BIOS 3610 03/29/2018 > > > <4>[ 4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70 > > > <4>[ 4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc > cc > > cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 > > <0f> 0b eb b1 > > 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 > > > <4>[ 4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246 > > > <4>[ 4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX: > > 0000000000000001 > > > <4>[ 4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI: > > ffffffff823fdfba > > > <4>[ 4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09: > > 0000000000000001 > > > <4>[ 4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12: > > ffff888103a88818 > > > <4>[ 4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15: > > 0000000000000001 > > > <4>[ 4.392084] FS: 00007f1f0d6d28c0(0000) > GS:ffff88822e680000(0000) > > knlGS:0000000000000000 > > > <4>[ 4.392087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > <4>[ 4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 > CR4: > > 00000000003706f0 > > > <4>[ 4.392091] DR0: 0000000000000000 DR1: 0000000000000000 > DR2: > > 0000000000000000 > > > <4>[ 4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > 0000000000000400 > > > <4>[ 4.392095] Call Trace: > > > <4>[ 4.392097] <TASK> > > > <4>[ 4.392100] ? __warn+0x7f/0x170 > > > <4>[ 4.392104] ? thermal_zone_trip_id+0x61/0x70 > > > <4>[ 4.392109] ? report_bug+0x1f8/0x200 > > > <4>[ 4.392116] ? handle_bug+0x3c/0x70 > > > <4>[ 4.392119] ? exc_invalid_op+0x18/0x70 > > > <4>[ 4.392123] ? asm_exc_invalid_op+0x1a/0x20 > > > <4>[ 4.392133] ? thermal_zone_trip_id+0x61/0x70 > > > <4>[ 4.392137] ? thermal_zone_trip_id+0x5d/0x70 > > > <4>[ 4.392141] trip_point_show+0x18/0x40 > > > <4>[ 4.392145] dev_attr_show+0x15/0x60 > > > <4>[ 4.392149] sysfs_kf_seq_show+0xb5/0x100 > > > <4>[ 4.392154] seq_read_iter+0x111/0x450 > > > <4>[ 4.392158] ? check_object+0x133/0x320 > > > <4>[ 4.392164] vfs_read+0x20d/0x300 > > > <4>[ 4.392175] ksys_read+0x64/0xe0 > > > <4>[ 4.392180] do_syscall_64+0x3c/0x90 > > > <4>[ 4.392183] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > > <4>[ 4.392187] RIP: 0033:0x7f1f0e193392 > > > > > > Can you please check what could be the reason for this issue? > > > > Well, one more unuseful lockdep assertion has been added recently to > > the thermal core, sorry about that. > > > > This commit > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux- > > pm.git/commit/?h=linux- > > next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5 > > > > that will be merged into linux-next tomorrow if all goes well, should > > address this. > > Thank you for the fix. We will wait for it to get merged in linux-next. > Happy to let to you know that we did not see these issues in the latest linux-next run. Thanks a lot of your quick resolutions. Regards Chaitanya > Regards > > Chaitanya > > > > > Thanks! > > > > > > > [1] > > > https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc > > > /b > > > oot0.txt > > > > > > Regards > > > > > > Chaitanya > > > > > > > > > > > > > > >> > > >>> From: Wysocki, Rafael J <rafael.j.wysocki@xxxxxxxxx> > > >>> Sent: Saturday, October 7, 2023 2:01 AM > > >>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx> > > >>> Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar > > >>> <suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani > > >>> <jani.saarinen@xxxxxxxxx> > > >>> Subject: Re: Regression in linux-next > > >>> > > >>> Hi, > > >>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote: > > >>> Hello Rafael, > > >>> > > >>> Hope you are doing well. I am Chaitanya from the linux graphics > > >>> team in > > >> Intel. > > >>> This mail is regarding a regression we are seeing in our CI > > >>> runs[1] on linux- > > >> next repository. > > >>> Thanks for the report, I think that this is a lockdep assertion failing. > > >>> If that is correct, it should be straightforward to fix. > > >>> I'll take care of this early next week. > > >>> Thanks! > > >>> > > >>> On next-20231003 [2], we are seeing the following error > > >>> > > >>> `````````````````````````````````````````````````````````````````` > > >>> `` `` ````````` <4>[ 14.093075] ------------[ cut here > > >>> ]------------ <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at > > >>> drivers/thermal/thermal_trip.c:18 > > >>> for_each_thermal_trip+0x83/0x90 <4>[ 14.106977] Modules linked in: > > >>> <4>[ 14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G > > >>> W 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[ > > >>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client > > >>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS > > >>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[ 14.134478] > RIP: > > >>> 0010:for_each_thermal_trip+0x83/0x90 > > >>> <4>[ 14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c > > >>> 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 > > >>> 2d 00 > > >>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 > > >>> 90 > > >>> 90 > > >>> 90 90 90 > > >>> > > >>> Details log can be found in [3]. > > >>> > > >>> After bisecting the tree, the following patch [4] seems to be > > >>> causing the > > >> regression. > > >>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb > > >>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@xxxxxxxxx > > >>> Date: Thu Sep 21 20:02:59 2023 +0200 > > >>> > > >>> ACPI: thermal: Do not use trip indices for cooling device > > >>> binding > > >>> > > >>> Rearrange the ACPI thermal driver's callback functions used > > >>> for cooling > > >>> device binding and unbinding, > > >>> acpi_thermal_bind_cooling_device() > > >>> and > > >>> acpi_thermal_unbind_cooling_device(), respectively, so that > > >>> they use trip > > >>> pointers instead of trip indices which is more > > >>> straightforward and allows > > >>> the driver to become independent of the ordering of trips in > > >>> the thermal > > >>> zone structure. > > >>> > > >>> The general functionality is not expected to be changed. > > >>> > > >>> Signed-off-by: Rafael J. Wysocki > > >>> mailto:rafael.j.wysocki@xxxxxxxxx > > >>> Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@xxxxxxxxxx > > >>> > > >>> We also verified by moving the head of the tree to the previous commit. > > >>> > > >>> Could you please check why this patch causes the regression and if > > >>> we can > > >> find a solution for it soon? > > >>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html? > > >>> [2] > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi > > >>> t/ > > >>> co > > >>> mmit/?h=next-20231003 [3] > > >>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp > > >>> -6 > > >>> /b > > >>> oot0.txt [4] > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi > > >>> t/ > > >>> co mmit/?h=next- > > 20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb