Re: Regression on linux-next (next-20241120) and drm-tip

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Dec 03, 2024 at 05:24:27PM +0000, Furquim, Ulisses wrote:
+Rodrigo
________________________________
From: Intel-gfx <intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Saarinen, Jani <jani.saarinen@xxxxxxxxx>
Sent: Tuesday, December 3, 2024 8:22 AM
To: Thomas Weißschuh <linux@xxxxxxxxxxxxxx>; Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx>; Krogerus, Heikki <heikki.krogerus@xxxxxxxxx>
Cc: Rafael J. Wysocki <rafael@xxxxxxxxxx>; Kurmi, Suresh Kumar <suresh.kumar.kurmi@xxxxxxxxx>; Coelho, Luciano <luciano.coelho@xxxxxxxxx>; Nikula, Jani <jani.nikula@xxxxxxxxx>; De Marchi, Lucas <lucas.demarchi@xxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx <intel-gfx@xxxxxxxxxxxxxxxxxxxxx>; intel-xe@xxxxxxxxxxxxxxxxxxxxx <intel-xe@xxxxxxxxxxxxxxxxxxxxx>; linux-pm@xxxxxxxxxxxxxxx <linux-pm@xxxxxxxxxxxxxxx>; Sebastian Reichel <sebastian.reichel@xxxxxxxxxxxxx>
Subject: RE: Regression on linux-next (next-20241120) and drm-tip

+@Krogerus, Heikki

-----Original Message-----
From: Thomas Weißschuh <linux@xxxxxxxxxxxxxx>
Sent: Tuesday, 3 December 2024 18.08
To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx>
Cc: Rafael J. Wysocki <rafael@xxxxxxxxxx>; Kurmi, Suresh Kumar
<suresh.kumar.kurmi@xxxxxxxxx>; Coelho, Luciano <luciano.coelho@xxxxxxxxx>;
Saarinen, Jani <jani.saarinen@xxxxxxxxx>; Nikula, Jani <jani.nikula@xxxxxxxxx>;
De Marchi, Lucas <lucas.demarchi@xxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx;
intel-xe@xxxxxxxxxxxxxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx; Sebastian Reichel
<sebastian.reichel@xxxxxxxxxxxxx>
Subject: Re: Regression on linux-next (next-20241120) and drm-tip

On 2024-12-03 16:57:21+0100, Thomas Weißschuh wrote:
> On 2024-12-03 15:42:23+0000, Borah, Chaitanya Kumar wrote:
> >
> >
> > > -----Original Message-----
> > > From: Thomas Weißschuh <linux@xxxxxxxxxxxxxx>
> > > Sent: Tuesday, December 3, 2024 8:20 PM
> > > To: Rafael J. Wysocki <rafael@xxxxxxxxxx>
> > > Cc: Borah, Chaitanya Kumar <chaitanya.kumar.borah@xxxxxxxxx>;
> > > Kurmi, Suresh Kumar <suresh.kumar.kurmi@xxxxxxxxx>; Coelho,
> > > Luciano <luciano.coelho@xxxxxxxxx>; Saarinen, Jani
> > > <jani.saarinen@xxxxxxxxx>; Nikula, Jani <jani.nikula@xxxxxxxxx>;
> > > De Marchi, Lucas <lucas.demarchi@xxxxxxxxx>;
> > > intel-gfx@xxxxxxxxxxxxxxxxxxxxx; intel- xe@xxxxxxxxxxxxxxxxxxxxx;
> > > linux-pm@xxxxxxxxxxxxxxx; Sebastian Reichel
> > > <sebastian.reichel@xxxxxxxxxxxxx>
> > > Subject: Re: Regression on linux-next (next-20241120) and drm-tip
> > >
> > > On 2024-12-03 15:33:21+0100, Rafael J. Wysocki wrote:
> > > > On Tue, Dec 3, 2024 at 1:04 PM Thomas Weißschuh
> > > <linux@xxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On 2024-12-03 12:54:54+0100, Rafael J. Wysocki wrote:
> > > > > > On Tue, Dec 3, 2024 at 7:51 AM Thomas Weißschuh
> > > <linux@xxxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > (+Cc Sebastian)
> > > > > > >
> > > > > > > Hi Chaitanya,
> > > > > > >
> > > > > > > On 2024-12-03 05:07:47+0000, Borah, Chaitanya Kumar wrote:
> > > > > > > > Hope you are doing well. I am Chaitanya from the linux
> > > > > > > > graphics team
> > > in Intel.
> > > > > > > >
> > > > > > > > This mail is regarding a regression we are seeing in our
> > > > > > > > CI runs[1] on
> > > linux-next repository.
> > > > > > >
> > > > > > > Thanks for the report.
> > > > > > >
> > > > > > > > Since the version next-20241120 [2], we are seeing the
> > > > > > > > following regression
> > > > > > > >
> > > > > > > > `````````````````````````````````````````````````````````````````````````````````
> > > > > > > > <4>[   19.990743] Oops: general protection fault, probably for non-
> > > canonical address 0xb11675ef8d1ccbce: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > > > <4>[   19.990760] CPU: 21 UID: 110 PID: 867 Comm: prometheus-
> > > node Not tainted 6.12.0-next-20241120-next-20241120-gac24e26aa08f+
> > > #1
> > > > > > > > <4>[   19.990771] Hardware name: Intel Corporation Arrow Lake
> > > Client Platform/MTL-S UDIMM 2DPC EVCRB, BIOS
> > > MTLSFWI1.R00.4400.D85.2410100007 10/10/2024
> > > > > > > > <4>[   19.990782] RIP:
0010:power_supply_get_property+0x3e/0xe0
> > > > > > > > ````````````````````````````````````````````````````````
> > > > > > > > `````` ``````````````````` Details log can be found in
> > > > > > > > [3].
> > > > > > > >
> > > > > > > > After bisecting the tree, the following patch [4] seems
> > > > > > > > to be the first
> > > "bad"
> > > > > > > > commit
> > > > > > > >
> > > > > > > > ````````````````````````````````````````````````````````
> > > > > > > > `````` ```````````````````````````````````````````
> > > > > > > > Commit 49000fee9e639f62ba1f965ed2ae4c5ad18d19e2
> > > > > > > > Author:     Thomas Weißschuh <mailto:linux@xxxxxxxxxxxxxx>
> > > > > > > > AuthorDate: Sat Oct 5 12:05:03 2024 +0200
> > > > > > > > Commit:     Sebastian Reichel
> > > <mailto:sebastian.reichel@xxxxxxxxxxxxx>
> > > > > > > > CommitDate: Tue Oct 15 22:22:20 2024 +0200
> > > > > > > >     power: supply: core: add wakeup source inhibit by
> > > > > > > > power_supply_config
> > > > > > > > ````````````````````````````````````````````````````````
> > > > > > > > `````` ```````````````````````````````````````````
> > > > > > > >
> > > > > > > > This is now seen in our drm-tip runs as well. [5]
> > > > > > > >
> > > > > > > > Could you please check why the patch causes this
> > > > > > > > regression and
> > > provide a fix if necessary?
> > > > > > >
> > > > > > > I don't see how this patch can lead to this error.
> > > > > >
> > > > > > It looks like the cfg->no_wakeup_source access reaches
> > > > > > beyond the struct boundary for some reason.
> > > > >
> > > > > But the access to this field is only done in __power_supply_register().
> > > > > The error reports however don't show this function at all,
> > > > > they come from power_supply_uevent() and
> > > > > power_supply_get_property() by which time the call to
__power_supply_register() is long over.
> > > > >
> > > > > FWIW there is an uninitialized 'struct power_supply_config' in
> > > > > drivers/hid/hid-corsair-void.c. But I highly doubt the test
> > > > > machines are using that. (I'll send a patch later for it)
> > > >
> > > > So the only way I can think about in which the commit in
> > > > question may lead to the reported issues is that changing the
> > > > size of struct power_supply_config or its alignment makes an
> > > > unexpected functional difference somewhere.
> > >
> > > Indeed. I'd really like to see this issue reproduced with KASAN.
> > >
> > > > AFAICS, this commit cannot be reverted by itself, so which
> > > > commits on top of it need to be reverted in order to revert it cleanly?
> > >
> > > All the other patches from this series:
> > > https://lore.kernel.org/lkml/20241005-power-supply-no-wakeup-sourc
> > > e-v1-
> > > 0-1d62bf9bcb1d@xxxxxxxxxxxxxx/
> > >
> > > Could you point me to the full boot log in the drm-tip CI?
> >
> > Here is the log for drm-tip CI
> >
> > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_8136/bat-arls-5/boot0.t
> > xt
>
> Thanks!
>
> > I carried out another bisect and it points to the following commit
> >
> > commit 226ff2e681d006eada59a9693aa1976d4c15a7d4
> > Author: Heikki Krogerus <heikki.krogerus@xxxxxxxxxxxxxxx>
> > Date:   Wed Nov 6 17:06:05 2024 +0200
> >
> >     usb: typec: ucsi: Convert connector specific commands to bitmaps
> >
> >     That allows the fields in those command data structures to
> >     be easily validated. If an unsupported field is accessed, a
> >     warning is generated.
>
> Suspicous: The bitmaps introduced in this commit are right before the
> psy and psy_desc members of 'struct ucsi_connector'.
> So any out-of-bounds writes into these members would corrupt those
> fields.
> A corrupted power_supply_desc would fit both reported stacktraces.

struct ucsi_connector {
       ...

        struct typec_capability typec_cap;

       /* Cached command responses. */
       DECLARE_BITMAP(cap, UCSI_GET_CONNECTOR_CAPABILITY_SIZE);
       DECLARE_BITMAP(status, UCSI_GET_CONNECTOR_STATUS_SIZE);

DECLARE_BITMAP() takes the size in number of *bits*

        struct power_supply *psy;
        struct power_supply_desc psy_desc;
        u32 rdo;

       ...
}

static int ucsi_get_connector_status(struct ucsi_connector *con, bool conn_ack) {
       u64 command = UCSI_GET_CONNECTOR_STATUS |
UCSI_CONNECTOR_NUMBER(con->num);
       size_t size = min(UCSI_GET_CONNECTOR_STATUS_SIZE,
UCSI_MAX_DATA_LENGTH(con->ucsi));
       int ret;

       ret = ucsi_send_command_common(con->ucsi, command, &con->status,
size, conn_ack);

ucsi_send_command_common() takes the size in number of *bytes*.
This call corrupts psy and psy_desc in con.

       return ret < 0 ? ret : 0;
}

>
> > Reverting it seems to help locally. However, to confirm I have sent out a patch
to our "try-bot"


Yeah, I can confirm it fixes it for me too - I reverted both
"usb: typec: ucsi: Convert connector specific commands to bitmaps" "usb: typec: ucsi: Fix a missing bits to bytes conversion in ucsi_init()"

Looking at the patch and what the second commit is fixing it looks
suspicious.... adding some additional warnings shows e.g.:
ucsi_get_connector_status(). It's passing con->status as arg, but
size can be up to 152.

Considering UCSI_GET_CONNECTOR_STATUS_SIZE is used for DECLARE_BITMAP,
that's also bit-sized.  This fixes it for me, although it could also use
the BITS_TO_BYTES:

-----------8<-------------
diff --git a/drivers/usb/typec/ucsi/ucsi.c b/drivers/usb/typec/ucsi/ucsi.c
index c435c0835744a..69caae84879e9 100644
--- a/drivers/usb/typec/ucsi/ucsi.c
+++ b/drivers/usb/typec/ucsi/ucsi.c
@@ -651,7 +651,8 @@ static void ucsi_unregister_altmodes(struct ucsi_connector *con, u8 recipient)
 static int ucsi_get_connector_status(struct ucsi_connector *con, bool conn_ack)
 {
 	u64 command = UCSI_GET_CONNECTOR_STATUS | UCSI_CONNECTOR_NUMBER(con->num);
-	size_t size = min(UCSI_GET_CONNECTOR_STATUS_SIZE, UCSI_MAX_DATA_LENGTH(con->ucsi));
+	size_t size = min(sizeof(con->status),
+			  UCSI_MAX_DATA_LENGTH(con->ucsi));
 	int ret;
ret = ucsi_send_command_common(con->ucsi, command, &con->status, size, conn_ack);
-----------8<-------------

Let me send that as a patch to the right mailing list.

Lucas De Marchi



[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux