On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario <Mario.Limonciello@xxxxxxx> wrote: > > [AMD Official Use Only - General] > > > -----Original Message----- > > From: Karol Herbst <kherbst@xxxxxxxxxx> > > Sent: Thursday, June 1, 2023 12:19 PM > > To: Limonciello, Mario <Mario.Limonciello@xxxxxxx> > > Cc: Nick Hastings <nicholaschastings@xxxxxxxxx>; Lyude Paul > > <lyude@xxxxxxxxxx>; Lukas Wunner <lukas@xxxxxxxxx>; Salvatore > > Bonaccorso <carnil@xxxxxxxxxx>; 1036530@xxxxxxxxxxxxxxx; Rafael J. > > Wysocki <rafael@xxxxxxxxxx>; Len Brown <lenb@xxxxxxxxxx>; linux- > > acpi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > > regressions@xxxxxxxxxxxxxxx > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) > > > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario > > <Mario.Limonciello@xxxxxxx> wrote: > > > > > > [AMD Official Use Only - General] > > > > > > > -----Original Message----- > > > > From: Karol Herbst <kherbst@xxxxxxxxxx> > > > > Sent: Thursday, June 1, 2023 11:33 AM > > > > To: Limonciello, Mario <Mario.Limonciello@xxxxxxx> > > > > Cc: Nick Hastings <nicholaschastings@xxxxxxxxx>; Lyude Paul > > > > <lyude@xxxxxxxxxx>; Lukas Wunner <lukas@xxxxxxxxx>; Salvatore > > > > Bonaccorso <carnil@xxxxxxxxxx>; 1036530@xxxxxxxxxxxxxxx; Rafael J. > > > > Wysocki <rafael@xxxxxxxxxx>; Len Brown <lenb@xxxxxxxxxx>; linux- > > > > acpi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > > > > regressions@xxxxxxxxxxxxxxx > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > > system) > > > > > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > > > > <mario.limonciello@xxxxxxx> wrote: > > > > > > > > > > +Lyude, Lukas, Karol > > > > > > > > > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > > > > > Hi, > > > > > > > > > > > > * Nick Hastings <nicholaschastings@xxxxxxxxx> [230530 16:01]: > > > > > >> * Mario Limonciello <mario.limonciello@xxxxxxx> [230530 13:00]: > > > > > > <snip> > > > > > >>> As you're actually loading nouveau, can you please try > > > > nouveau.runpm=0 on > > > > > >>> the kernel command line? > > > > > >> I'm not intentionally loading it. This machine also has intel graphics > > > > > >> which is what I prefer. Checking my > > > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf > > > > > >> I see: > > > > > >> > > > > > >> blacklist nvidia > > > > > >> blacklist nvidia-drm > > > > > >> blacklist nvidia-modeset > > > > > >> blacklist nvidia-uvm > > > > > >> blacklist ipmi_msghandler > > > > > >> blacklist ipmi_devintf > > > > > >> > > > > > >> So I thought I had blacklisted it but it seems I did not. Since I do not > > > > > >> want to use it maybe it is better to check if the lock up occurs with > > > > > >> nouveau blacklisted. I will try that now. > > > > > > I blacklisted nouveau and booted into a 6.1 kernel: > > > > > > % uname -a > > > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 > > > > (2023-05-08) x86_64 GNU/Linux > > > > > > > > > > > > It has been running without problems for nearly two days now: > > > > > > % uptime > > > > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 > > > > > > > > > > > > Regards, > > > > > > > > > > > > Nick. > > > > > > > > > > Thanks, that makes a lot more sense now. > > > > > > > > > > Nick, Can you please test if nouveau works with runtime PM in the > > > > > latest 6.4-rc? > > > > > > > > > > If it works in 6.4-rc, there are probably nouveau commits that need > > > > > to be backported to 6.1 LTS. > > > > > > > > > > If it's still broken in 6.4-rc, I believe you should file a bug: > > > > > > > > > > https://gitlab.freedesktop.org/drm/nouveau/ > > > > > > > > > > > > > > > Lyude, Lukas, Karol > > > > > > > > > > This thread is in relation to this commit: > > > > > > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > > > > > > > > > Nick has found that runtime PM is *not* working for nouveau. > > > > > > > > > > > > > keep in mind we have a list of PCIe controllers where we apply a > > > > workaround: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers > > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 > > > > > > > > And I suspect there might be one or two more IDs we'll have to add > > > > there. Do we have any logs? > > > > > > There's some archived onto the distro bug. Search this page for > > "journalctl.log.gz" > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530 > > > > > > > interesting.. It seems to be the same controller used here. I wonder > > if the pci topology is different or if the workaround is applied at > > all. > > I didn't see the message in the log about the workaround being applied > in that log, so I guess PCI topology difference is a likely suspect. > yeah, but I also couldn't see a log with the usual nouveau messages, so it's kinda weird. Anyway, the output of `lspci -tvnn` would help > > > > But yeah, I'd kinda love for somebody with better knowledge on all of > > this to figure out what exactly is going wrong, but everytime this > > gets investigated Intel says "our hardware has no bugs", the ACPI > > folks dig for months and find nothing and I end up figuring out some > > weirdo workaround I don't understand. And apparently also nobody is > > able to hand out docs explaining in detail how that runtime > > suspend/resume stuff is supposed to work. > > > > I have a Dell XPS 9560 where the added workaround in nouveau fixed the > > problem and I know it's fixed on a bunch of other systems. So if > > anybody is willing to publish docs and/or actually debug it with > > domain knowledge, please go ahead. > > > > > > And could anybody test if adding the > > > > controller in play here does resolve the problem? > > > > > > > > > If you recall we did 24867516f06d because 5775b843a619 was > > > > > supposed to have fixed it. > > > > > > > > >