I've had problems like this happen to me on 3par too. What kernel version
are you using? It almost always happened when the SAN got a RSCN (using
when another server was rebooted) I found that, at least in kernel 2.6.11.7,
that if I changed the line
bio->bi_rw != (1 << BIO_RW_FAILFAST); to
bio->bi_rw != (0 << BIO_RW_FAILFAST);
in drivers/md/dm_mpath.c
the problem went away. Now, in the newest kernels, after there was a big
change to the qla drivers (2.6.12-rc? and beyond, I believe) I did not need
to do the above change, but I now get aborts sometimes (these aborts
apparently come from the qlogic card). The aborts recover, but I have been
unable to determine why I am getting them.
Andy
We're running 2.6.9-11.ELsmp, off of redhat ES 4.1. I don't exactly have
the entire list of redhat patches on hand, so I can't say for sure. Nor
can I actually modify our kernel without losing support to the box. If
this is fixed with a kernel upgrade, we can open a support ticket from
redhat and scream/yell until they apply the patch.
However, I'd like to know what the exact issue is. I'm not exactly great
on eliciting issues with the linux kernel right now. How were you
monitoring what events the SAN was sending up through the card? I could
use this to at least verify what is happening if/when we lose another
mount. None of our servers were being rebooted when this happened though.
-Alan