Re: [PATCH 7/8] qla2xxx: Stall mid-layer error handlers while rport is blocked.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Mike Christie wrote:
James Smart wrote:
Given this is the 3rd instance of this (qla2xxxx, lpfc, mpt fusion),
we should either:

- Fix the error handler. (but we all know this is a lot of work,
    of which none of us have the time to do, nor expect it to
    be complete in time for our next distro delivery).

I understand the bugs in the eh. I have worked around them in iscsi and
tried to fix them in scsi-ml :) (still working on the queuecommand
SCSI_ML_HOST/DEVICE_BUSY fix), but along with the problems in the eh
where we could get the device offlined there could really be times when
the device needs to be offlined and reonlined, right?

True...

> For iscsi we do
not really worry about either, in our userspace daemon we have code
where if the device was offlined and the daemon has corrected the
problem (or in qla4xxx case has been notified that the problem has been
corrected), then we online the devices.

Ok - but that's not really the intent around offlining.  Offlining implies
that recovery steps were taken, but it didn't result in a functional device,
thus retries are likely to fail as well - which implies that device media
is corrupt and could use some user interaction to clean up (filesystem check
and the like). So - it's not always the best ideal to simply online after
resolving the link state for the device.

That said, there are scenarios in which we lose connectivity altogether
to the device, but never end up taking it offline (e.g. we fail the i/o
outright in queuecommand without going through the error handler). I would
assume the device media is as screwed up as when offline was justified.
Perhaps it's because we expect the upper layers to be preserving the data
that was contained in the retry (consider block cache data) and to
eventually attempt to resync this data when connectivity to the device is
restored. If we have this scenario, it implies that simply onlining after
the link state is ok, is all right..

> Since FC, has added a netlink
interface could we add something like fc_rport state changed event
support to some daemom. The daemon could online the device when the
rport state is back up if needed.

Well - this is similar to what we talked about in the storage bof at
OLS.  We decided to add kobject calls for block and unblock to the
block device. These are the events you could key off of. I'll complete
the patch for these events.

I was also thinking that the iscsi code has some common features and
maybe iscsi and fc could share something in some sort of blktool daemon.

Sounds useful - we just have to make sure it's keeping the system sane.

Or do you think the userspace daemon is more of hack in userspace. I
cannot tell when I am hacking around something in the kernel or doing
something nifty in userspace anymore :)

As pointed out above, unconditionally onlining a device is not really
the right solution, although it is what users unconditionally do after
loosing connectivity and encountering the offline state. Anyone else
with some thoughts here. Users really don't understand the offline state
and the manual onlining once they believe they have restored connectivity
to the device.

-- james
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux