Re: [Bug 200917] 4.18 regression: I/O error on external icybox disk enclosures

Klaus Kusche <klaus.kusche@xxxxxxxxxxxxxxx> · Sat, 1 Sep 2018 14:40:09 +0200

On 30/08/2018 22:37, Alan Stern wrote:
On Wed, 29 Aug 2018, Klaus Kusche wrote:

Hello,

On 24/08/2018 19:28, Alan Stern wrote:
On Fri, 24 Aug 2018, Klaus Kusche wrote:
On 24/08/2018 17:39, Alan Stern wrote:
On Fri, 24 Aug 2018, Klaus Kusche wrote:
On 24/08/2018 16:15, Alan Stern wrote:
On Fri, 24 Aug 2018, Klaus Kusche wrote:
I entered the following USB bug into kernel bugzilla yesterday:

https://bugzilla.kernel.org/show_bug.cgi?id=200917

"Since 4.18, all my external USB3-to-SATA Icybox disk enclosures with usb Id
357d:7788 (seems to be a very common controller chip: Sharkoon QuickPort XT)
fail with the following error when mounting an ext4 fs:
print_req_error: critical target error, dev sdd, sector 2048
Buffer I/O error on dev sdd1, logical block 0, lost sync page write
EXT4-fs (sdd1): I/O error while writing superblock
EXT4-fs (sdd1): mount failed

- They worked before 4.18.
- Reading is definitely ok, async writing seems to work, too.
- The problem occurs with several different disks (I only tested HGST drives).
- The same disks work in enclosures with other controllers."

It does sound like a bug in the enclosure.

It is that specific controller chip.
I tried 3 enclosures with that usb id (all fail since 4.18),
and 5 enclosures with different usb id's (all still work).

Please provide the output from "dmesg".

[ 3692.559336] sd 7:0:0:0: [sdd] 976773168 512-byte logical blocks: (500 GB/466
GiB)
[ 3692.559595] sd 7:0:0:0: [sdd] Write Protect is off
[ 3692.559598] sd 7:0:0:0: [sdd] Mode Sense: 47 00 10 08
[ 3692.559881] sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled,
supports DPO and FUA
[ 3692.575820]  sdd: sdd1
[ 3692.576941] sd 7:0:0:0: [sdd] Attached SCSI disk
[ 3725.164065] sd 7:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 3725.164071] sd 7:0:0:0: [sdd] tag#0 Sense Key : Illegal Request [current]
[ 3725.164075] sd 7:0:0:0: [sdd] tag#0 Add. Sense: Invalid field in cdb
[ 3725.164080] sd 7:0:0:0: [sdd] tag#0 CDB: Write(10) 2a 08 00 00 08 00 00 00 08 00

This indicates the error occurred shortly after the drive was plugged
in.  A usbmon trace might be helpful.  Can you collect and send a trace
for bus 4, starting shortly before you plug in the drive and ending
after the error occurs?

	cat /sys/kernel/debug/usb/usbmon/4u >usb4.out

Found time to build a kernel with debugfs and usbmon and to test.
Result attached.
Does it contain the command causing the error?

The usbmon trace shows that all the preceding commands executed
correctly.  It must be the WRITE(10) which causes the problem.

Does the trace tell why the command is refused,
and if the disk drive or the controller caused the error?

The error does not happen when plugging the drive in.
It happens when rw-mounting an ext4 fs on the drive
(as far as I know, the affected sector 2048 is indeed the ext4 superblock).
Ro-mounting and reading the same ext4 fs in the same enclusoure works fine,
and a vfat on such a drive can even be rw-mounted and successfully written.
Hence, obviously rw-mounting an ext4 fs emits some special write command
which fails with that controller since 4.18.

The command which actually failed was a perfectly standard WRITE(10),
although it has the FUA (Force Unit Access) bit turned on.  Perhaps
that caused the problem, or perhaps an earlier command sent the
controller chip into some sort of error state.

If it's that command: What added that command in 4.18?

The FUA bit could be turned on by a mount option, such as -o sync.  If
that's not the cause, I don't know what is.

The fs is mounted "async".

However, as I said in my initial error report,
this happens during mount while the *superblock* of the fs is written.
I think there are good reasons for writing the superblock "sync"
even when mounting the fs async:
The filesystem should be marked "in use" on the media *before*
making any other changes to the filesystem.

Was anything changed in the ext4 mount code in 4.18?

Is there any chance to circumvent or work around the failing command,
or do I have to replace my 3 enclosures with that controller?

Have you tried running e2fsck -f on the ext4 filesystem before mounting
it?

The problem occurs on all ext4 filesystems I tried,
even on newly formatted ones.
fsck says that the filesystems are clean,
and a forced fsck finds no problems.

--
Prof. Dr. Klaus Kusche
Private address: Rosenberg 41, 07546 Gera, Germany
+49 365 20413058 klaus.kusche@xxxxxxxxxxxxxxx https://www.computerix.info
Office address: DHGE Gera, Weg der Freundschaft 4, 07546 Gera, Germany
+49 365 4341 306 klaus.kusche@xxxxxxx https://www.dhge.de