Hi,
On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
Hello Rafael,
-----Original Message-----
From: Wysocki, Rafael J <rafael.j.wysocki@xxxxxxxxx>
Sent: Tuesday, October 10, 2023 12:54 AM
To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx>
Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar
<suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani <jani.saarinen@xxxxxxxxx>
Subject: Re: Regression in linux-next
Hi,
On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
Hello Rafael
Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!
Thank you for your response. Please let us know when a fix is available.
It should be fixed in linux-next from today, by this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
pm.git/commit/?h=linux-
next&id=b44444027ce7714f309e96b804b7fb088a40d708
Thanks!
Thanks a lot for the fix. This seems to have fixed the issue in most of the machines but we are still seeing a similar problem in few of the machines.
Thanks for reporting this!
This has a different call stack but seems to be from the same thermal subsystem. Full logs in [1]
<4>[ 4.392015] WARNING: CPU: 1 PID: 306 at drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
<4>[ 4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
<4>[ 4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
<4>[ 4.392061] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4>[ 4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
<4>[ 4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
<4>[ 4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
<4>[ 4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX: 0000000000000001
<4>[ 4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI: ffffffff823fdfba
<4>[ 4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09: 0000000000000001
<4>[ 4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12: ffff888103a88818
<4>[ 4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15: 0000000000000001
<4>[ 4.392084] FS: 00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000) knlGS:0000000000000000
<4>[ 4.392087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4: 00000000003706f0
<4>[ 4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[ 4.392095] Call Trace:
<4>[ 4.392097] <TASK>
<4>[ 4.392100] ? __warn+0x7f/0x170
<4>[ 4.392104] ? thermal_zone_trip_id+0x61/0x70
<4>[ 4.392109] ? report_bug+0x1f8/0x200
<4>[ 4.392116] ? handle_bug+0x3c/0x70
<4>[ 4.392119] ? exc_invalid_op+0x18/0x70
<4>[ 4.392123] ? asm_exc_invalid_op+0x1a/0x20
<4>[ 4.392133] ? thermal_zone_trip_id+0x61/0x70
<4>[ 4.392137] ? thermal_zone_trip_id+0x5d/0x70
<4>[ 4.392141] trip_point_show+0x18/0x40
<4>[ 4.392145] dev_attr_show+0x15/0x60
<4>[ 4.392149] sysfs_kf_seq_show+0xb5/0x100
<4>[ 4.392154] seq_read_iter+0x111/0x450
<4>[ 4.392158] ? check_object+0x133/0x320
<4>[ 4.392164] vfs_read+0x20d/0x300
<4>[ 4.392175] ksys_read+0x64/0xe0
<4>[ 4.392180] do_syscall_64+0x3c/0x90
<4>[ 4.392183] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[ 4.392187] RIP: 0033:0x7f1f0e193392
Can you please check what could be the reason for this issue?
Well, one more unuseful lockdep assertion has been added recently to the
thermal core, sorry about that.
This commit
https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5
that will be merged into linux-next tomorrow if all goes well, should
address this.
Thanks!
[1] https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt
Regards
Chaitanya
From: Wysocki, Rafael J <rafael.j.wysocki@xxxxxxxxx>
Sent: Saturday, October 7, 2023 2:01 AM
To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx>
Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kurmi, Suresh Kumar
<suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani
<jani.saarinen@xxxxxxxxx>
Subject: Re: Regression in linux-next
Hi,
On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
Hello Rafael,
Hope you are doing well. I am Chaitanya from the linux graphics team in
Intel.
This mail is regarding a regression we are seeing in our CI runs[1] on linux-
next repository.
Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!
On next-20231003 [2], we are seeing the following error
``````````````````````````````````````````````````````````````````````
````````` <4>[ 14.093075] ------------[ cut here ]------------ <4>[
14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18
for_each_thermal_trip+0x83/0x90 <4>[ 14.106977] Modules linked in:
<4>[ 14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W
6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
14.121305] Hardware name: Intel Corporation Meteor Lake Client
Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[ 14.134478] RIP:
0010:for_each_thermal_trip+0x83/0x90
<4>[ 14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
90 90 90
Details log can be found in [3].
After bisecting the tree, the following patch [4] seems to be causing the
regression.
commit d5ea889246b112e228433a5f27f57af90ca0c1fb
Author: Rafael J. Wysocki mailto:rafael.j.wysocki@xxxxxxxxx
Date: Thu Sep 21 20:02:59 2023 +0200
ACPI: thermal: Do not use trip indices for cooling device binding
Rearrange the ACPI thermal driver's callback functions used for
cooling
device binding and unbinding, acpi_thermal_bind_cooling_device()
and
acpi_thermal_unbind_cooling_device(), respectively, so that they
use trip
pointers instead of trip indices which is more straightforward
and allows
the driver to become independent of the ordering of trips in the
thermal
zone structure.
The general functionality is not expected to be changed.
Signed-off-by: Rafael J. Wysocki
mailto:rafael.j.wysocki@xxxxxxxxx
Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@xxxxxxxxxx
We also verified by moving the head of the tree to the previous commit.
Could you please check why this patch causes the regression and if we can
find a solution for it soon?
[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
mmit/?h=next-20231003 [3]
https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/b
oot0.txt [4]
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
mmit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb