On 7/17/24 04:55, David Howells wrote: > Hi James, > > I'm wondering if I'm seeing a problem with DIO writes through Ext4 or XFS > manifesting as SCSI misalignment errors. This has occurred with two different > drives. I saw it first with v6.10-rc6, I think, but I haven't tried > cachefiles for a while. It does happen with v6.10. > > ata1.00: exception Emask 0x60 SAct 0x1 SErr 0x800 action 0x6 frozen > ata1.00: irq_stat 0x20000000, host bus error Bus error is a serious error... > ata1: SError: { HostInt } > ata1.00: failed command: WRITE FPDMA QUEUED > ata1.00: cmd 61/68:00:b0:93:34/00:00:02:00:00/40 tag 0 ncq dma 53248 out > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x60 (host bus error) > ata1.00: status: { DRDY } > ata1: hard resetting link > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) That is very low... Old hardware ? > ata1.00: configured for UDMA/133 > sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s > sd 0:0:0:0: [sda] tag#0 Sense Key : Illegal Request [current] > sd 0:0:0:0: [sda] tag#0 Add. Sense: Unaligned write command That is likely the result of the automatice generation of sense data for failed commands based on ata status and error fields for a failed command, which defaults to this when nothing else matches (yeah, I know, that is not pretty. But the SAT specs in that area are a nightmare and following them actually ends up with this asc/ascq. Will try to do something about it). The host bus error is the issue. Not sure what triggers it though. What is the adapter model you are using ? > sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 02 34 93 b0 00 00 68 00 > I/O error, dev sda, sector 37000112 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0 > ata1: EH complete > > For reference, I made it dump the result of the READ CAPACITY 16 command: > > sd 0:0:0:0: [sda] RC16 000000003a38602f000002000000000000000000000000000000000000000000 > > The drive says it has 512-byte logical and physical block sizes. > > The DIO writes are being generated by cachefiles and are all > PAGE_SIZED-aligned in terms of file offset and request length. > > I also saw this: > > CacheFiles: I/O Error: Trunc-to-dio-size failed -95 [o=000001cb] > > which indicates that ext4/xfs returned EOPNOTSUPP to vfs_truncate() and thence > to cachefiles. I'm not sure why it would do that. > > Any idea what might cause this or how to investigate it further? Is it > possible it's some sort of hardware error in the I/O bridge or IOMMU? > > Thanks, > David > > -- Damien Le Moal Western Digital Research