On Fri, May 10, 2013 at 5:53 PM, Martin K. Petersen <martin.petersen@xxxxxxxxxx> wrote: >>>>>> "Baruch" == Baruch Even <baruch@xxxxxxxxx> writes: > > Baruch> Actually reducing the timeouts is probably not a good approach > Baruch> since it will cause the host to take a more radical approach > Baruch> without waiting sufficiently for a potential recovery. > > Reducing the eh timeout is a requirement in many clustered setups. We've > been shipping a predecessor to this patch in our kernels for a long > time. > Baruch> In addition the more radical error handlings such as host reset > Baruch> will destroy other paths for completely unrelated devices/links, > Baruch> from my experience a host reset is usually not required and the > Baruch> Linux kernel currently reaches to this big hammer too fast. > > I'm also working on a patch to add some heuristics to avoid the HBA and > bus resets if I/O is completing successfully on other attached > targets. But that's an orthogonal issue. Why? In my experience (again, SAS based inside a storage device) the reduced eh timeout is more likely to cause escalated problems rather than resolve the issue. I actually find that the higher level should have a small timeout of its own to do its own recovery work, which normally entails going to other copies of the data where available and let the device try to get the IO done if possible. Not sure how applicable it is to the kernel itself but I do feel it could be relevant. Baruch -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html