On 21. Dec 2024, at 0.14, John Meneghini <jmeneghi@xxxxxxxxxx> wrote: > ... > Yes, I signed-off on a patch that introduced a regression. Of course, I was unaware of this at the time. > > Laurence and I worked on commit 9604eea5bd3a (scsi: st: Add third party poweron reset handling) and tested it extensively with a scsi gateway using third party resets. The patch fixed the problem. The customer tested it. We told the customer they needed to reposition/rewind the tape with mt rewind following any third party poweron/reset event - which apparently happens not infrequently in their environment. This worked for the customer so we thought it was good. The st driver had never correctly handled a POR UA before so we didn't think the fact that MTIOGET returned EIO following a third party reset was a problem. > > However, the part that we were not aware of was: tape drives set a poweron/reset UA when the machine reboots. So we started getting complaints from different customers about the fact that MTIOCGET suddenly fails with EIO after upgrade. The work around was simple (issue an 'mt rewind') but customers did not like this, and when more than one or two customers started calling and complaining about this, we reverted the patch to avoid generating more calls. This is when I opened Bug 219419 - Can not query tape status - MTIOCGET fails with EIO. > > https://bugzilla.kernel.org/show_bug.cgi?id=219419 > > We have fixed that problem now and the fix, including commit 9604eea5bd3a, is being disseminated in rhel-8 and rhel-9. > > I see now that stinit and dd, and possibly other IOCTLS, are unexpectedly failing too. I'm not sure if we can call these regressions or not. From what I can see the st driver never really handled POR UA's correctly. It only worked with sg_reset (first party reset)... but I agree that customers probably will not like this. It is a regression because it breaks user space. First, I want to emphasise that your patch was correctly fixing the problem it was supposed to fix. And I acked it. The problems after boot were far-fetched and not easy to see. This happens. My mild complaint is that there were no reports on linux-scsi about the problems after boot. Someone might have seen these problems earlier if there were those reports. (With hindsight, Rafal Rocha's patch dealt with this problem.) But now we know where the problem is. Let'f concentrate on fixing it. ... >> The test uses modprobe to load scsi_debug (and this loads also st). After that >> the tools mentioned above were tried: > > If you want to share the details of exactly what your tests are doing (privately if you'd like) I will try to reproduce your results. Obviously, one problem here is that our tape tests here at RH (and everywhere) are inadequate - at least w.r.t. resets. I'm working to improve this. I was experimenting with different things using (slightly enhanced) csi_debug and scripts using snippets from your testing scripts. When I set scsi_level to 6 in scsi_debug, I noticed that after modprobe, stinit and also dd failed. This lead me to think that this was because now st caught the initial POR UA. The distribution I used for testing also had an udev rule to run stinit when a drive was detected. The current stinit does return 0 also when setting options fails, but I could see that the options did not get set. You should see these things with a real tape drive, too. ... > I really don't want to revert commit 9604eea5bd3ae1. It actually fixes a real problem where the tape drive behind a gateway device crashes and resets itself. Then, because the st driver ignores the POR/UA, it writes a file mark at the BOT. This destroys the customer's backup. It is a serious problem and a day-one bug. I agree that the patch is useful. That is why it is better to fix the negative consequencies. >> Modifying st to accept what stinit uses even is pos_unknown is true fixes the stinit problem, >> but dd still fails. > > OK, shouldn't dd fail if the pos_unknown is true? Basically, anytime the tape device reports that it has been reset the application NEEDS to reposition the tape. And, for that matter, the application should also check and set any options that might be wanted. The application can't just ignore these POR Unit Attentions. Yes. But I don't think pos_unknown should be set for the initial POR UA. I think it is reasonable to assume that the user does not trust knowing the tape location right after the device is added. > >> Not setting pos_unknown after the initial POR UA fixes both problems. > > That's fine with me... I just don't understand how you can distinguish "the initial POR UA" from any POR UA. If you have a patch that can figure out how to do this... that's great... let's try. I sent the patch ([PATCH] scsi: st: Regression fix: Don't set pos_unknown just after device recognition) to linux-scsi on Dec 16.