On 11/14/24 11:51, "Kai Mäkisara (Kolumbus)" wrote:>
On 7. Nov 2024, at 18.05, John Meneghini <jmeneghi@xxxxxxxxxx> wrote:
On 11/3/24 13:32, "Kai Mäkisara (Kolumbus)" wrote:
On 31. Oct 2024, at 3.00, John Meneghini <jmeneghi@xxxxxxxxxx> wrote:
Be sure to clear was_reset when ever we clear pos_unknown. If we don't
then the code in st_chk_result() will turn on pos_unknown again.
STp->pos_unknown |= STp->device->was_reset;
This results in confusion as future commands fail unexpectedly.
This brings in my mind (again) the question: is the hack using was_reset set
by scsi_error to detect device reset necessary any more? It was introduced
as a temporary method somewhere between 1.3.20 and 1.3.30 (in 1995)
when the POR (power on/reset) UAs (Unit Attention) were not passed to st.
The worst problem with this hack is clearing was_reset. St should not clear
something set by error handling (layering violation).
OK. I wasn't aware of the history here... but I agree we want to get rid of the layering violation.
I have read code and done experiments. I think I now know how the POR UAs were lost
in 1990s. And why that does not happen now.
In 1.3.30, scsi.c handled resets. When resetting, it set the device->expecting_cc_ua flag.
If UA was received when expecting_cc_ua was set, the command was retried if
retries were allowed. St allowed five retries for TEST UNIT READY and so the POR UA
was left in the midlevel. (If expecting_cc_ua was zero, UA:s were returned to the upper
level.)
Now, st does allow zero retries. In addition to this, the requests use either REQ_OP_DRV_IN
or REQ_OP_DRV_OUT and these requests are not retried. So, now POR UAs reach
st (as the experiments have indicated).
Yes, the only time I've see the POR UA not get passed to the st driver is with scsi_debug, when
the driver first loads.
[tape_tests]# modprobe -r scsi_debug
[tape_tests]# modprobe scsi_debug ptype=1 max_luns=1 dev_size_mb=10000
[Fri Nov 15 09:28:51 2024] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1)
[Fri Nov 15 09:28:51 2024] scsi host8: scsi_debug: version 0191 [20210520]
dev_size_mb=10000, opts=0x0, submit_queues=1, statistics=0
[Fri Nov 15 09:28:51 2024] scsi 8:0:0:0: Sequential-Access Linux scsi_debug 0191 PQ: 0 ANSI: 7
[Fri Nov 15 09:28:51 2024] scsi 8:0:0:0: Power-on or device reset occurred
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sg driver is clearing the UA here.
[Fri Nov 15 09:28:51 2024] st 8:0:0:0: Attached scsi tape st1
[Fri Nov 15 09:28:51 2024] st 8:0:0:0: st1: try direct i/o: yes (alignment 4 B)
[Fri Nov 15 09:28:51 2024] st 8:0:0:0: Attached scsi generic sg3 type 1
[tape_tests]# mt -f /dev/nst1 status
[Fri Nov 15 09:28:55 2024] st 8:0:0:0: [st1] check_tape: 1064: pos_unknown 0 was_reset 0 ready 0
[Fri Nov 15 09:28:55 2024] st 8:0:0:0: [st1] Error: 402, cmd: 5 0 0 0 0 0
[Fri Nov 15 09:28:55 2024] st 8:0:0:0: [st1] Sense Key : Illegal Request [current]
[Fri Nov 15 09:28:55 2024] st 8:0:0:0: [st1] Add. Sense: Invalid command operation code
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There should be a UA right here.
I have no idea whey the scsi_debug driver doesn't support READ BLOCK LIMITS.
[Fri Nov 15 09:28:55 2024] st 8:0:0:0: [st1] st_chk_result: 421: pos_unknown 0 was_reset 0 ready 0, result 1026
[Fri Nov 15 09:28:55 2024] st 8:0:0:0: [st1] Can't read block limits.
SCSI 2 tape drive:
File number=-1, block number=-1, partition=0.
Tape block size 0 bytes. Density code 0x0 (default).
Soft error count since last status=0
General status bits on (1010000):
ONLINE IM_REP_EN
Another problem with using scsi_debug is:
[tape_tests]# sg_reset --target /dev/sg3
This does nothing. I have to use the /dev/st device to reset. This is not true if you have a real tape device.
[tape_tests]# sg_reset --target /dev/nst1
[Fri Nov 15 09:37:18 2024] st 8:0:0:0: [st1] check_tape: 1064: pos_unknown 0 was_reset 1 ready 0
[Fri Nov 15 09:37:18 2024] st 8:0:0:0: Power-on or device reset occurred
[Fri Nov 15 09:37:18 2024] st 8:0:0:0: [st1] Error: 2, cmd: 0 0 0 0 0 0
[Fri Nov 15 09:37:18 2024] st 8:0:0:0: [st1] Sense Key : Unit Attention [current]
[Fri Nov 15 09:37:18 2024] st 8:0:0:0: [st1] Add. Sense: Scsi bus reset occurred
[Fri Nov 15 09:37:18 2024] st 8:0:0:0: [st1] st_chk_result: 421: pos_unknown 1 was_reset 1 ready 0, result 2
^^^^^^^^^^^^^^^^^^
Now we've set pos_unkown.
Your earlier patch added code to st to set pos_unknown when a POR UA
was found. So, is this now alone enough to catch the resets?
As long at the tape device actually sends a UA, no I don't think we need the was_reset code anymore.
We should remove the device->was_reset code from st.c. >
Using device->was_reset has the advantage that it does not depend on the
drives sending UA. However, I think that a probability of a working drive
not sending UA after reset is minimal.
Agreed. There are some idiosyncrasies in the scsi_debug tape emulation. But I don't want those problems to make their way into
the st driver. I don't think the st driver should be paying attention to was_reset, or anything else in the mid layer.
I now did some experiments with scsi_debug and in those experiments
reset initiated using sg_reset did result in st getting POR UA.
But this was just a simple and somewhat artificial test.
Yes, I've noticed that not all of the different emulators out there like MHVTL and scsi_debug
will reliably send a POR UA following sg_reset. Especially scsi_debug. The tape emulation in
scsi_debug is inadequate when it comes to resets AFAIAC.
I'll see if I can develop some further tests for MHVLT, but scsi_debug isn't worth the
trouble (for me) and I've told our QE group here at Red Hat to stop testing the st driver
with scsi_debug. Doing this has led to too many false positives and passing tests. That's
why I finally purchased a tape drive and started testing this myself.
I have a more positive view about scsi_debug. Its tape support is minimal, but it can be
useful if you consider what it does support and what it doesn't support.
A real drive is better, of course.
Please try my latest version of the tape_reset_debug.sh script.
https://github.com/johnmeneghini/tape_tests/blob/master/tape_reset_debug.sh
This test uses the scsi_debug driver with ptype=1 to test these patches using "sg_reset --target /dev/nst$i".
There is also the tape_reset_debug_sg.sh script. That script runs the same tests using "sg_reset --target /dev/sg$i"
They each give me different false negative and false positive results... but that may because of something I'm doing wrong.
If you want to remove the device->was_reset from the st driver I can help by running
the patch through my tests with the tape device.
I will send patches tomorrow. The patches will also include a fix to another problem:
some parameters should be restored after reset in case of some operations like rewind:
the partition, density and block size should be restored to the values set before reset.
OK, I will test those as soon as they show up.
Thanks for your testing (so far and in future).
And Thanks for your help fixing these bugs...
/John
Kai