Re: imx6 sata cdrom driver issue

Fabio Estevam <festevam@xxxxxxxxx> · Thu, 28 May 2015 16:16:49 -0300

On Tue, May 19, 2015 at 4:18 PM, Kevin Groeneveld
<KGroeneveld@xxxxxxxxxxxx> wrote:
> -----Original Message-----
> From: Jonathan Bagg
> Sent: May-19-15 3:14 PM
>
>>Adding my co-worker Kevin who has done a few experiments in the kernel
>
> I am one of Jonathan's co-workers who has been helping to diagnose and
> fix the SATA issue we are experiencing.  Most of my testing has been
> with an optical drive connected directly to the Freescale Sabre board.
>
> Using the debug patch I wrote that Jon posted earlier I can see that the
> response to the ATAPI 0x4A command (Get Event Status Notification)
> during initial boot is always random. For example:
>
> [    2.237783] CDB rbuf: [4a 01 00 00 10 00 00 00 08]
> [    2.237789] CDB rbuf: 00000000: 7e 31 f7 fd ff f5 7d fd
> ~1....}.
>
> Running the same test on an iMX53 EVK board the response is always:
>
> [    2.452303] CDB rbuf: [4a 01 00 00 10 00 00 00 08]
> [    2.452308] CDB rbuf: 00000000: 00 06 04 56 02 02 00 00
> ...V....
>
> We have rented a Lecroy STX-231 SATA protocol analyzer. Using this tool
> I have verified that the response from the optical drive is always
> correct in this case but for some reason the data is corrupt by the time
> the debug code I added to atapi_qc_complete dumps the response.
>
> I have added debug code to dump the AHCI command header and table before
> issuing a command. I have not seen anything out of the ordinary here. I
> also added debug code to dump the received FIS on an interrupt. I
> haven't studied the received FIS extensively but I did not see anything
> out of the ordinary here either.
>
> One thing I suspected and confirmed with the Lecroy analyzer is that the
> response to the previous command during boot (ATAPI command 0x5A - Mode
> Sense) is always 64 bytes even though the kernel is requesting 128
> bytes. This exact behaviour is probably drive dependant.  On a hunch I
> modified the get_capabilities function in drivers/scsi/sr.c to only
> request 64 bytes. With that change the response to the Get Event Status
> Notification command during boot is always valid.  Although running
> cdparanoia or trying to keep using the system in other ways still
> results in issues.
>
> Using the Lecroy analyzer I found some other commands where cdparanoia
> requests lengths longer than the drive responds with.  Comparing the
> debug dumps in the kernel to the data captured by the analyzer I can see
> problems always start after this happens.
>
> To try and characterize this further I started messing around with the
> request lengths.  I noticed in the AHCI command header dumps I added
> that many of the ATAPI commands have a second SG DMA buffer. I found the
> second buffer is added when atapi_drain_needed returns true in
> libata-scsi.c.  I started hacking the DMA buffer sizes by manipulating
> the lengths written to the AHCI command table in the ahci_fill_sg
> function in libata.c.  I also hacked atap_drain_needed to sometimes
> return false to force only one DMA buffer for ATAPI commands.
>
> What I found is that if there are two DMA buffers and the response
> length is less than the first buffer size then the response to the next
> command is always corrupted.  Every other scenario I tried worked okay.
> Filling the first buffer and under, over or exactly using the second
> buffer does not trigger the issue.  If there is only one buffer under,
> over or exactly using the one buffer does not trigger the issue.
>
> One other thing I noticed is that if the response under runs the first
> buffer and the second ATAPI drain buffer is smaller than the next
> command response then the response to the next command is missing the
> begging of the response equal to the length of the second buffer from
> the previous command. I decided to add some debug code to print out the
> DMA buffer used for the previous command. What I found is that the
> missing part of the response is written to the DMA buffer from the
> previous command.  This is very bad as that memory could be dynamically
> allocated and subsequently freed and now in use by something else!
>
> Since overruns never seem to cause an issue in this setup I decided to
> simply change the atap_drain_needed function to always return false.
> With this change I can run cdparanoia successfully and rip audio tracks.
> I can also add a port multiplier back to the setup and connect both an
> HDD and the optical drive and experience no readily apparent issues.
>
> So it seems I have found a workaround for the most severe symptoms we
> were experiencing. But this still leaves me with several questions and
> concerns:
>
> 1. Is the root cause some obscure bug in the kernel I have not found
> yet? Or is it a hardware bug in the iMX6 AHCI SATA host? Or maybe
> something else silly we have been doing?
>
> 2. When cdparanoia is ripping the actual data it uses large reads which
> use multiple SG DMA buffers. What happens if the response is short in
> this case? My atapi_drain_needed hack won't help in this case.
> cdparanoia could probably be modified to only do smaller reads that fit
> in one page to avoid the issue.
>
> 3. Can the issue happen with non ATAPI commands?  If for any reason an
> ATA command to a HDD has a short response could all future commands (and
> possibly even system memory) be corrupted similar to what I saw with
> ATAPI commands? I wonder if the kernel could be easily modified to limit
> the size of ATA commands to avoid this scenario.
>
> 4. The Lecroy analyzer can also emulate a device. Using this mode I
> tried to trigger the issue with ATA commands. I couldn't figure out how
> to trigger the exact scenario I wanted.  I was able to cause the
> response to be too short, just not short enough to underrun anything but
> the last DMA buffer which doesn't seem to cause a problem.  What I did
> notice though was that this didn't seem to trigger an error anywhere in
> the kernel or user space, the data would just simply be corrupted. I
> would expect the system should flag the response as being invalid and an
> IO error propagated all the way to user space.

Just tested this and I could reproduce this bug on a mx6 board running
kernel 4.0.4.

Adding some more folks in case anyone has some ideas about this DMA
buffer corruption.

Regards,

Fabio Estevam
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html