Re: [PATCH] ata,scsi: do not issue START STOP UNIT on resume

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/26/23 02:09, Rodrigo Vivi wrote:
>>> So, maybe we have some kind of disks/configuration out there where this
>>> start upon resume is needed? Maybe it is just a matter of timming to
>>> ensure some firmware underneath is up and back to life?
>>
>> I do not think so. Suspend will issue a start stop unit command to put the drive
>> to sleep and resume will reset the port (which should wake up the drive) and
>> then issue an IDENTIFY command (which will also wake up the drive) and other
>> read logs etc to rescan the drive.
>> In both cases, if the commands do not complete, we would see errors/timeout and
>> likely port reset/drive gone events. So I think this is likely another subtle
>> race between scsi suspend and ata suspend that is causing a deadlock.
>>
>> The main issue I think is that there is no direct ancestry between the ata port
>> (device) and scsi device, so the change to scsi async pm ops made a mess of the
>> suspend/resume operations ordering. For suspend, scsi device (child of ata port)
>> should be first, then ata port device (parent). For resume, the reverse order is
>> needed. PM normally ensures that parent/child ordering, but we lack that
>> parent/child relationship. I am working on fixing that but it is very slow
>> progress because I have been so far enable to recreate any of the issues that
>> have been reported. I am patching "blind"...
> 
> I believe your suspicious makes sense. And on these lines, that patch you
> attached earlier would fix that. However my initial tries of that didn't
> help. I'm going to run more tests and get back to you.

Rodrigo,

I pushed the resume-v2 branch to libata tree:

git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/dlemoal/libata
(or https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git)

This branch adds 13 patches on top of 6.5.0 to cleanup libata suspend/resume and
other device shutdown issues. The first 4 patches are the main ones to fix
suspend resume. I tested that on 2 different machines with different drives and
with qemu. All seems fine.

Could you try to run this through your CI ? I am very interested in seeing if it
survives your suspend/resume tests.

If you can confirm that all issues are fixed, I will rebase this on for-next and
post.

Thanks !


-- 
Damien Le Moal
Western Digital Research




[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux