Re: [PATCH 2/3] scsi: st: clear was_reset when CHKRES_NEW_SESSION

"Kai Mäkisara (Kolumbus)" <kai.makisara@xxxxxxxxxxx> · Thu, 14 Nov 2024 18:51:21 +0200

> On 7. Nov 2024, at 18.05, John Meneghini <jmeneghi@xxxxxxxxxx> wrote:
> 
> On 11/3/24 13:32, "Kai Mäkisara (Kolumbus)" wrote:
>>> On 31. Oct 2024, at 3.00, John Meneghini <jmeneghi@xxxxxxxxxx> wrote:
>>> 
>>> Be sure to clear was_reset when ever we clear pos_unknown. If we don't
>>> then the code in st_chk_result() will turn on pos_unknown again.
>>> 
>>>        STp->pos_unknown |= STp->device->was_reset;
>>> 
>>> This results in confusion as future commands fail unexpectedly.
>> This brings in my mind (again) the question: is the hack using was_reset set
>> by scsi_error to detect device reset necessary any more? It was introduced
>> as a temporary method somewhere between 1.3.20 and 1.3.30 (in 1995)
>> when the POR (power on/reset) UAs (Unit Attention) were not passed to st.
>> The worst problem with this hack is clearing was_reset. St should not clear
>> something set by error handling (layering violation).
> 
> OK. I wasn't aware of the history here... but I agree we want to get rid of the layering violation.
> 
I have read code and done experiments. I think I now know how the POR UAs were lost
in 1990s. And why that does not happen now.

In 1.3.30, scsi.c handled resets. When resetting, it set the device->expecting_cc_ua flag.
If UA was received when expecting_cc_ua was set, the command was retried if
retries were allowed. St allowed five retries for TEST UNIT READY and so the POR UA
was left in the midlevel. (If expecting_cc_ua was zero, UA:s were returned to the upper
level.)

Now, st does allow zero retries. In addition to this, the requests use either REQ_OP_DRV_IN
or REQ_OP_DRV_OUT and these requests are not retried. So, now POR UAs reach
st (as the experiments have indicated).

>> Your earlier patch added code to st to set pos_unknown when a POR UA
>> was found. So, is this now alone enough to catch the resets?
> 
> As long at the tape device actually sends a UA, no I don't think we need the was_reset code anymore.
> We should remove the device->was_reset code from st.c.
> 
Using device->was_reset has the advantage that it does not depend on the
drives sending UA. However, I think that a probability of a working drive
not sending UA after reset is minimal.

>> I now did some experiments with scsi_debug and in those experiments
>> reset initiated using sg_reset did result in st getting POR UA.
>> But this was just a simple and somewhat artificial test.
> 
> Yes, I've noticed that not all of the different emulators out there like MHVTL and scsi_debug
> will reliably send a POR UA following sg_reset. Especially scsi_debug. The tape emulation in
> scsi_debug is inadequate when it comes to resets AFAIAC.
> 
> I'll see if I can develop some further tests for MHVLT, but scsi_debug isn't worth the
> trouble (for me) and I've told our QE group here at Red Hat to stop testing the st driver
> with scsi_debug.  Doing this has led to too many false positives and passing tests. That's
> why I finally purchased a tape drive and started testing this myself.

I have a more positive view about scsi_debug. Its tape support is minimal, but it can be
useful if you consider what it does support and what it doesn't support.

A real drive is better, of course.

> If you want to remove the device->was_reset from the st driver I can help by running
> the patch through my tests with the tape device.

I will send patches tomorrow. The patches will also include a fix to another problem:
some parameters should be restored after reset in case of some operations like rewind:
the partition, density and block size should be restored to the values set before reset.

Thanks for your testing (so far and in future).

Kai