Re: SCSI error indicating misalignment on part of Linux scsi or block layer?

Damien Le Moal <dlemoal@xxxxxxxxxx> · Wed, 17 Jul 2024 08:07:18 +0900

On 7/17/24 04:55, David Howells wrote:
> Hi James,
> 
> I'm wondering if I'm seeing a problem with DIO writes through Ext4 or XFS
> manifesting as SCSI misalignment errors.  This has occurred with two different
> drives.  I saw it first with v6.10-rc6, I think, but I haven't tried
> cachefiles for a while.  It does happen with v6.10.
> 
> ata1.00: exception Emask 0x60 SAct 0x1 SErr 0x800 action 0x6 frozen
> ata1.00: irq_stat 0x20000000, host bus error

Bus error is a serious error...

> ata1: SError: { HostInt }
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/68:00:b0:93:34/00:00:02:00:00/40 tag 0 ncq dma 53248 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x60 (host bus error)
> ata1.00: status: { DRDY }
> ata1: hard resetting link
> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

That is very low... Old hardware ?

> ata1.00: configured for UDMA/133
> sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
> sd 0:0:0:0: [sda] tag#0 Sense Key : Illegal Request [current] 
> sd 0:0:0:0: [sda] tag#0 Add. Sense: Unaligned write command

That is likely the result of the automatice generation of sense data for failed
commands based on ata status and error fields for a failed command, which
defaults to this when nothing else matches (yeah, I know, that is not pretty.
But the SAT specs in that area are a nightmare and following them actually ends
up with this asc/ascq. Will try to do something about it).

The host bus error is the issue. Not sure what triggers it though.
What is the adapter model you are using ?

> sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 02 34 93 b0 00 00 68 00
> I/O error, dev sda, sector 37000112 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
> ata1: EH complete
> 
> For reference, I made it dump the result of the READ CAPACITY 16 command:
> 
> sd 0:0:0:0: [sda] RC16 000000003a38602f000002000000000000000000000000000000000000000000
> 
> The drive says it has 512-byte logical and physical block sizes.
> 
> The DIO writes are being generated by cachefiles and are all
> PAGE_SIZED-aligned in terms of file offset and request length.
> 
> I also saw this:
> 
> 	CacheFiles: I/O Error: Trunc-to-dio-size failed -95 [o=000001cb]
> 
> which indicates that ext4/xfs returned EOPNOTSUPP to vfs_truncate() and thence
> to cachefiles.  I'm not sure why it would do that.
> 
> Any idea what might cause this or how to investigate it further?  Is it
> possible it's some sort of hardware error in the I/O bridge or IOMMU?
> 
> Thanks,
> David
> 
> 

-- 
Damien Le Moal
Western Digital Research