Re: Who do we point to?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



James Bottomley wrote:
On Thu, 2008-08-21 at 16:14 +0400, Vladislav Bolkhovitin wrote:
MOANING MODE ON

Testing SCST and target drivers I often have to deal with various failures and with how initiators recover from them. And, unfortunately, my observations on Linux aren't very encouraging. See, for instance, http://marc.info/?l=linux-scsi&m=119557128825721&w=2 thread. Receiving from the target TASK ABORTED status isn't really a failure, it's rather a corner case behavior, but it leads to immediate file system errors on initiator and then after remount ext3 journal replay doesn't completely repair it, only manual e2fsck helps. Even mounting with barrier=1 doesn't improve anything. Target can't be blamed for the failure, because it stayed online, all its cache fully healthy and no commands were lost. Hence, apparently, the journaling code in ext3 isn't as reliable in face of storage corner cases as it's thought. I haven't tried that test since I reported it, but recently I've seen the similar ext3 failures on 2.6.26 in other tests, so I guess the problem(s) still there.

A software SCSI target, like SCST, is beautiful to test things like that, because it allows easily simulate any possible corner case and storage failure. Unfortunately, I don't work on file systems level and can't participate in all that great testing and fixing effort. I can only help with setup and assistance in failures simulations.

MOANING MODE OFF

Well, since I can see your just so anxious to stop moaning and get
coding, let me help you.

Firstly, from a standards point of view, TASK_ABORTED means that the
target is telling us this particular command was killed by another
initiator (seeing this also requires the TAS bit to be set in the
control mode page, so you can easily fix your current problem by
unsetting it).  This makes TASK_ABORTED an incredibly rare status
condition (hence the problems below).

The way the kernel currently handles it is to return SUCCESS (around
line 1411 in scsi_error.c).  This return actually propagates an I/O
error all the way up the stack.  If the filesystem is the consumer, then
how it handles the error depends on what you have the errors= switch set
to.  If you've got it set to a safety condition like remount-ro or
panic, then the fs should be recoverable on reboot (or unmount recheck).
If you have it set to something unsafe like continue, then yes, you're
asking for trouble and fs corruption ... but it's hardly the OSs fault,
you told it you didn't want to operate safely.

Yes, we already agreed in the referenced thread that there are 2 separate and completely unrelated problems were discovered here:

1. Handling of TASK_ABORTED status is different from handling "Commands
cleared by another initiator" Unit Attention.

2. The file system layer after receiving an I/O error handles something not too well. I use default mount and format options, so "errors" was "remount-ro", but recovery on reboot wasn't sufficient.

We in the SCSI layer can fix (1), but only FS people can fix (2).

So, given what TASK_ABORT means, it looks to me like the handling should
go through the maybe_retry path.  I'd say that's about a three line
patch ... and since you have the test bed, you can even try it out.

OK, I'll prepare it.

James




--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux