On Wed, Jul 12, 2023 at 10:06:56AM +0200, Paolo Bonzini wrote:
On 7/11/23 22:21, Mike Christie wrote:
What was the issue you are seeing?
Was it something like you get the UA. We retry then on one of the
retries the sense is not setup correctly, so the scsi error handler
runs? That fails and the device goes offline?
If you turn on scsi debugging you would see:
[ 335.445922] sd 0:0:0:0: [sda] tag#15 Add. Sense: Reported luns data has changed
[ 335.445922] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.445925] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.445929] sd 0:0:0:0: [sda] tag#17 Done: FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ 335.445932] sd 0:0:0:0: [sda] tag#17 CDB: Write(10) 2a 00 00 db 4f c0 00 00 20 00
[ 335.445934] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.445936] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.445938] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.445940] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.445942] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.445945] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 335.451447] scsi host0: scsi_eh_0: waking up 0/2/2
[ 335.451453] scsi host0: Total of 2 commands on 1 devices require eh work
[ 335.451457] sd 0:0:0:0: [sda] tag#16 scsi_eh_0: requesting sense
Does this log come from internal discussions within Oracle?
I don't know the qemu scsi code well, but I scanned the code for my co-worker
and my guess was commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2 had a race in it.
How is locking done? when it is a bus level UA but there are multiple devices
on the bus?
No locking should be necessary, the code is single threaded. However,
what can happen is that two consecutive calls to
virtio_scsi_handle_cmd_req_prepare use the unit attention ReqOps, and
then the second virtio_scsi_handle_cmd_req_submit finds no unit
attention (see the loop in virtio_scsi_handle_cmd_vq). That can
definitely explain the log above.
Yes, this seems to be the case!
Thank you both for the help!
Following Paolo's advice, I'm preparing a series for QEMU to solve the
problem!
Stefano