On 2018-06-28 09:38 AM, Martin Wilck wrote:
[I've added Hannes, Ben and Douglas to the recepient list to fill in knowledge from the past that I may lack]. tl;dr summary: We've got 3 issues: 1) Why does multipath, in reinstate_paths(), try to reinstate paths which are known to be down? 2) rescan-scsi-bus.sh can call "multipath" even if "-m" switch is not used (that looks like a bug to me). 3) In Jiaojianbing's environment, dead paths that have been removed on the target and were already marked "offline" may appear as "running" after rescan-scsi-bus.sh invocation. Furthermore, 4) perhaps rescan-scsi-bus.sh should replace suboptimal "multipath" calls with multipathd cli commands (or better even, we multipath-tools people should eventually finish the "delegate to multipathd" work). On Thu, 2018-06-28 at 06:35 +0000, Jiaojianbing wrote:Dear Christophe, when dm-105 is in one state of below, paths of dm-105 will change to active if we run command of multipath.Could you be more specific please? What multipath command did you run? Which version of multipath-tools are you running?command is "multipath", which can run in shell as below: #multipath... and if I understand correctly, originally the problem occured while running rescan_scsi_bus.sh. Please also state the version of sg3_utils you are using.And the version: multipath-tools v0.4.9 (05/33, 2016)Well, that's ancient. But latest multipath-tools still has the same code.I check code of multipath, it sends messge "reinstate_path pathname" to kernel in routine reinstate_paths when status of pathgroup = "PGSTATE_ENABLED/PGSTATE_UNDEF" and path's state = "PSTATE_FAILED". why command of multipath do above action to all dm devices? actually, parts of these paths are already offline or failed which can't be recovered. Maybe we can check these devices's status by sending io to these sd device at first. according to return of io, multipath send reinstate to running devices and do nothing to failed devices?I see this code in reinstate_paths(): vector_foreach_slot (pgp->paths, pp, j) { if (pp->state != PATH_UP && (pgp->status == PGSTATE_DISABLED || pgp->status == PGSTATE_ACTIVE)) continue; if (pp->dmstate == PSTATE_FAILED) { if (dm_reinstate_path(mpp->alias, pp-dev_t))condlog(0, "%s: error reinstating", pp->dev); } } The reinstate command is only sent for paths which are either in PATH_UP state, or belong to an PGSTATE_ENABLED path group. I admit I'm unsure why all we try to reinstate paths that we know are down. This is 13- year-old code. Interstingly, the state of your paths changes from "faulty offline" to "ready running". So it appears that these paths are actually _not_ down Just the reinstate seems has failed on them. multipathd -v3 logs and possibly kernel logs would be helpful to understand what was going on in that situation.Sorry, maybe my two multipath status sample confused you. They are just sample. Actually, I run command "rescan-scsi-bus" to clear all mapped scsi devices by iscsid in host when all of LUNS in remote IPSAN are removed. In process of running rescan-scsi-bus, if command "multipath" is running, the status of dm's path will change from failed to active in some moment as below. If IO is sent to dm-105, the process who sends io will be in D state. # multipath -ll 36d0d04b100b8cba665a187f0000000f9 dm-105 HUAWEI ,XSG1 size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 18:0:0:101 sdku 67:288 active faulty runningThe strange part here is that the device is considered "running". This is the state of the kernel device. If the LUNs are actually _removed_ as you say, the device should be gone, or at least marked "offline". Apparently the SCSI bus SCAN via iSCSI still showed the LUN in a workable state. For multipath this translates to PATH_UP. Thus even if the above code didn't have the (pgp->status == PGSTATE_DISABLED || pgp-status == PGSTATE_ACTIVE) clause, the reinstate would have beenattempted by multipath. This looks like a low-level problem in your SCSI or iSCSI layer to me. This looks like the actual problem to me. multipath aside, if the path appears to be "running", any Linux process could try to send IO down to it and be stuck, as you say.I want to know whether command "multipath" is reasonable in reinstate_paths().And maybe we should not call "multipath" in process of runningrescan-scsi-bus ?Normally rescan-scsi-bus.sh should call "multipath" only if the "-m|--multipath" switch was used. I quickly scanned through the code and didn't find a call to "multipath" (with no options) which wasn't guarded by the [ -n "$mp_enable" ] condition. (FTR: there is a call to "multipath -f" from main->flushmpaths if "-f|--flush" is set). Again, please double-check your version of sg3_utils, and perhaps run "bash -x rescan-scsi-bus.sh" to figure out the call chain which runs the "multipath" command. Thanks, Martin
Hi, My upstream version of rescan-scsi-bus.sh is attached. The last change was the --ignore-rev option from Gris Ge <fge@xxxxxxxxxx>. He has sent several cleanups in the last year, usually via Hannes' github site for sg3_utils. My ChangeLog entry to that script (since sg3_utils 1.42) is: - rescan-scsi-bus.sh: harden code - fixes from Suse; bump version - bump version to 20180615 - add to install list in Makefile, hope it does not clash with other package providing it - add --ignore-rev to ignore revision change If there are no further changes it will be like that in sg3_utils-1.43 revision 780. Doug Gilbert
Attachment:
rescan-scsi-bus.sh
Description: application/shellscript
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel