On Thu, 18 Nov 2021 14:09:13 -0700 Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > On Thu, 18 Nov 2021 20:51:41 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > > On 11/17/2021 11:23 PM, Alex Williamson wrote: > > Thanks Alex for checking this series and providing your inputs. > > > > > If we're transitioning a device to D3cold rather than D3hot as > > > requested by userspace, isn't that a user visible change? > > > > For most of the driver, in linux kernel, the D3hot vs D3cold > > state will be decided at PCI core layer. In the PCI core layer, > > pci_target_state() determines which D3 state to choose. It checks > > for platform_pci_power_manageable() and then it calls > > platform_pci_choose_state() to find the target state. > > In VM, the platform_pci_power_manageable() check will fail if the > > guest is linux OS. So, it uses, D3hot state. > > Right, but my statement is really more that the device PM registers > cannot be used to put the device into D3cold, so the write of the PM > register that we're trapping was the user/guest's intention to put the > device into D3hot. We therefore need to be careful about differences > in the resulting device state when it comes out of D3cold vs D3hot. > > > But there are few drivers which does not use the PCI framework > > generic power related routines during runtime suspend/system suspend > > and set the PCI power state directly with D3hot. > > Current vfio-pci being one of those ;) > > > Also, the guest can be non-Linux OS also and, in that case, > > it will be difficult to know the behavior. So, it may impact > > these cases. > > That's what I'm worried about. > > > > For instance, a device may report NoSoftRst- indicating that the device > > > does not do a soft reset on D3hot->D0 transition. If we're instead > > > putting the device in D3cold, then a transition back to D0 has very > > > much undergone a reset. On one hand we should at least virtualize the > > > NoSoftRst bit to allow the guest to restore the device, but I wonder if > > > that's really safe. Is a better option to prevent entering D3cold if > > > the device isn't natively reporting NoSoftRst-? > > > > > > > You mean to say NoSoftRst+ instead of NoSoftRst- as visible in > > Oops yes. The concern is if the user/guest is not expecting a soft > reset when using D3hot, but we transparently promote D3hot to D3cold > which will always implies a device reset. > > > the lspci output. For NoSoftRst- case, we do a soft reset on > > D3hot->D0 transition. But, will this case not be handled internally > > in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend, > > we check for pci_dev->state_saved flag and do pci_save_state() > > irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars() > > will be called in pci_raw_set_power_state() which will reinitialize device > > for D3hot/D3cold-> D0 case. Once the device is initialized in the host, > > then for guest, it should work without re-initializing again in the > > guest side. I am not sure, if my understanding is correct. > > The soft reset is not limited to the state that the PCI subsystem can > save and restore. Device specific state that the user/guest may > legitimately expect to be retained may be reset as well. > > [PCIe v5 5.3.1.4] > Functional context is required to be maintained by Functions in > the D3 hot state if the No_Soft_Reset field in the PMCSR is Set. > > Unfortunately I don't see a specific definition of "functional > context", but I interpret that to include device specific state. For > example, if a GPU contains specific frame buffer data and reports > NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data > to be retained on D3hot->D0 transition? > > > > We're also essentially making a policy decision on behalf of > > > userspace that favors power saving over latency. Is that > > > universally the correct trade-off? > > > > For most drivers, the D3hot vs D3cold should not be favored due > > to latency reasons. In the linux kernel side, I am seeing, the > > PCI framework try to use D3cold state if platform and device > > supports that. But its correct that covertly replacing D3hot with > > D3cold may be concern for some drivers. > > > > > I can imagine this could be desirable for many use cases, > > > but if we're going to covertly replace D3hot with D3cold, it seems > > > like there should be an opt-in. Is co-opting the PM capability for > > > this even really acceptable or should there be a device ioctl to > > > request D3cold and plumbing through QEMU such that a VM guest can > > > make informed choices regarding device power management? > > > > > > > Making IOCTL is also an option but that case, this support needs to > > be added in all hypervisors and user must pass this information > > explicitly for each device. Another option could be to use > > module parameter to explicitly enable D3cold support. If module > > parameter is not set, then we can call pci_d3cold_disable() and > > in that case, runtime PM should not use D3cold state. > > > > Also, I was checking we can pass this information though some > > virtualized register bit which will be only defined for passing > > the information between guest and host. In the guest side if the > > target state is being decided with pci_target_state(), then > > the D3cold vs D3hot should not matter for the driver running > > in the guest side and in that case, it depends upon platform support. > > We can set this virtualize bit to 1. But, if driver is either > > setting D3hot state explicitly or has called pci_d3cold_disable() or > > similar API available in the guest OS, then set this bit to 0 and > > in that case, the D3cold state can be disabled in the host side. > > But don't know if is possible to use some non PCI defined > > virtualized register bit. > > If you're suggesting a device config space register, that's troublesome > because we can't guarantee that simply because a range of config space > isn't within a capability that it doesn't have some device specific > purpose. However, we could certainly implement virtual registers in > the hypervisor that implement the ACPI support that an OS would use on > bare metal to implement D3cold. Those could trigger this ioctl through > the vfio device. > > > I am not sure what should be best option to make choice > > regarding d3cold but if we can have some option by which this > > can be done without involvement of user, then it will benefit > > for lot of cases. Currently, the D3cold is supported only in > > very few desktops/servers but in future, we will see on > > most of the platforms. > > I tend to see it as an interesting hack to promote D3hot to D3cold, and > potentially very useful. However, we're also introducing potentially > unexpected device behavior, so I think it would probably need to be an > opt-in. Possibly if the device reports NoSoftRst- we could use it by > default, but even virtualizing the NoSoftRst suggests that there's an > expectation that the guest driver has that support available. > > > > Also if the device is not responsive to config space due to the user > > > placing it in D3 now, I'd expect there are other ioctl paths that > > > need to be blocked, maybe even MMIO paths that might be a gap for > > > existing D3hot support. Thanks, > > > > I was in assumption that most of IOCTL code will be called by the > > hypervisor before guest OS boot and during that time, the device > > will be always in D0. But, if we have paths where IOCTL can be > > called when the device has been suspended by guest OS, then can we > > use runtime_get/put API’s there also ? > > It's more a matter of preventing user actions that can cause harm > rather than expecting certain operations only in specific states. We > could chose to either resume the device for those operations or fail > the operation. We should probably also leverage the memory-disable > support to fault mmap access to MMIO when the device is in D3* as well. It also occurred to me last night that a guest triggering D3hot via the PM registers must be a synchronous power state change, we can't use auto-suspend. This is necessary for nested assignment where the guest might use a D3hot->D0 power state transition with NoSoftRst- devices in order to perform a reset of the device. With auto-suspend, the guest would return the device to D0 before the physical device ever timed out to enter a D3 state. Thanks, Alex