Brian De Wolf wrote: > Hello again, > > I've been testing multipath-tool's rdac capability with a qla2xxx HBA and an IBM > DS4800 some more and I've hit another stumbling block. When I test unplugging > one of the HBA ports and plugging it back in with multipath running, it seems to > cause bad things to happen. Here is what the syslog looks like (note: sdb is a > path, sdd is initially unused, and sde is the second path): > > Jul 19 14:30:35 jimbo kernel: qla2xxx 0000:02:01.1: LOOP DOWN detected (2). > Jul 19 14:30:41 jimbo kernel: rport-4:0-0: blocked FC remote port time out: > removing target and saving binding > Jul 19 14:30:41 jimbo kernel: sd 4:0:0:0: [sde] Synchronizing SCSI cache > Jul 19 14:30:41 jimbo kernel: sd 4:0:0:0: [sde] Result: hostbyte=0x01 > driverbyte=0x00 > Jul 19 14:30:48 jimbo multipathd: sde: rdac checker reports path is down > Jul 19 14:30:48 jimbo multipathd: checker failed path 8:64 in map test > Jul 19 14:30:48 jimbo kernel: scsi 4:0:0:0: rejecting I/O to dead device > Jul 19 14:30:48 jimbo kernel: device-mapper: multipath: Failing path 8:64. > Jul 19 14:30:48 jimbo multipathd: test: remaining active paths: 1 > Jul 19 14:30:48 jimbo multipathd: test: switch to path group #2 > Jul 19 14:30:52 jimbo kernel: qla2xxx 0000:02:01.1: LIP reset occured (f700). > Jul 19 14:30:52 jimbo kernel: qla2xxx 0000:02:01.1: LIP occured (f700). > Jul 19 14:30:52 jimbo kernel: qla2xxx 0000:02:01.1: LIP reset occured (f7f7). > Jul 19 14:30:53 jimbo kernel: scsi 4:0:0:0: rejecting I/O to dead device > Jul 19 14:30:53 jimbo multipathd: sde: rdac checker reports path is down > Jul 19 14:30:53 jimbo kernel: qla2xxx 0000:02:01.1: LOOP UP detected (4 Gbps). > Jul 19 14:30:53 jimbo kernel: scsi 4:0:0:0: Direct-Access IBM 1815 > FAStT 0914 PQ: 0 ANSI: 3 > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] 6291456 512-byte hardware > sectors (3221 MB) > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] Write Protect is off > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] Mode Sense: 77 00 10 08 > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] Write cache: enabled, read > cache: enabled, supports DPO and FUA > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] 6291456 512-byte hardware > sectors (3221 MB) > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] Write Protect is off > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] Mode Sense: 77 00 10 08 > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] Write cache: enabled, read > cache: enabled, supports DPO and FUA > Jul 19 14:30:53 jimbo kernel: sdd: sdd1 > Jul 19 14:30:53 jimbo kernel: sd 4:0:0:0: [sdd] Attached SCSI disk > Jul 19 14:30:53 jimbo kernel: scsi 4:0:0:0: Direct-Access IBM 1815 > FAStT 0914 PQ: 0 ANSI: 3 > Jul 19 14:30:53 jimbo kernel: kobject_add failed for 4:0:0:0 with -EEXIST, don't > try to register things with the same name in the same directory. > Jul 19 14:30:53 jimbo kernel: > Jul 19 14:30:53 jimbo kernel: Call Trace: > Jul 19 14:30:53 jimbo kernel: [<ffffffff802e1d9b>] kobject_shadow_add+0x187/0x191 > Jul 19 14:30:53 jimbo kernel: [<ffffffff8033a495>] device_add+0xa1/0x59d > Jul 19 14:30:53 jimbo kernel: [<ffffffff803638e8>] scsi_sysfs_add_sdev+0x2e/0x24a > Jul 19 14:30:53 jimbo kernel: [<ffffffff80361f18>] > scsi_probe_and_add_lun+0x6ff/0x80f > Jul 19 14:30:53 jimbo kernel: [<ffffffff803612c8>] scsi_alloc_sdev+0x195/0x1ea > Jul 19 14:30:53 jimbo kernel: [<ffffffff80362580>] __scsi_scan_target+0x3e9/0x549 > Jul 19 14:30:53 jimbo kernel: [<ffffffff80416d83>] thread_return+0x0/0xe2 > Jul 19 14:30:53 jimbo kernel: [<ffffffff80362777>] scsi_scan_target+0x97/0xbc > Jul 19 14:30:53 jimbo kernel: [<ffffffff88003668>] > :scsi_transport_fc:fc_scsi_scan_rport+0x59/0x79 > Jul 19 14:30:53 jimbo kernel: [<ffffffff8800360f>] > :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x79 > Jul 19 14:30:53 jimbo kernel: [<ffffffff802379c4>] run_workqueue+0x84/0x105 > Jul 19 14:30:53 jimbo kernel: [<ffffffff80237a45>] worker_thread+0x0/0xf4 > Jul 19 14:30:53 jimbo kernel: [<ffffffff80237b2f>] worker_thread+0xea/0xf4 > Jul 19 14:30:53 jimbo kernel: [<ffffffff8023addd>] autoremove_wake_function+0x0/0x2e > Jul 19 14:30:53 jimbo kernel: [<ffffffff8023addd>] autoremove_wake_function+0x0/0x2e > Jul 19 14:30:53 jimbo kernel: [<ffffffff8023a888>] kthread+0x3d/0x63 > Jul 19 14:30:53 jimbo kernel: [<ffffffff8020a338>] child_rip+0xa/0x12 > Jul 19 14:30:53 jimbo kernel: [<ffffffff8023a84b>] kthread+0x0/0x63 > Jul 19 14:30:53 jimbo kernel: [<ffffffff8020a32e>] child_rip+0x0/0x12 > Jul 19 14:30:53 jimbo kernel: > Jul 19 14:30:53 jimbo kernel: error 1 > Jul 19 14:30:53 jimbo kernel: scsi 4:0:0:0: Unexpected response from lun 0 while > scanning, scan aborted > Jul 19 14:30:53 jimbo scsi.agent[8613]: disk at > /devices/pci0000:00/0000:00:02.0/0000:02:01.1/host4/rport-4:0-0/target4:0:0/4:0:0:0 > Jul 19 14:30:53 jimbo multipathd: sdd: add path (uevent) > Jul 19 14:30:53 jimbo kernel: scsi 4:0:0:0: rejecting I/O to dead device > Jul 19 14:30:53 jimbo multipathd: sde: checker msg is "rdac checker reports path > is down" > Jul 19 14:30:53 jimbo kernel: device-mapper: multipath rdac: using RDAC command > with timeout 15000 > Jul 19 14:30:53 jimbo kernel: device-mapper: table: 254:6: multipath: error > getting device > Jul 19 14:30:53 jimbo kernel: device-mapper: ioctl: error adding target to table > Jul 19 14:30:53 jimbo multipathd: test: failed in domap for addition of new path sdd > Jul 19 14:30:53 jimbo multipathd: test: uev_add_path sleep > ... > >>From here, the last 5 lines get repeated until I 'kill -9' the multipathd > process. I'm not too keen on kernel internals (though playing with multipathing > is bringing me up to speed pretty quick), but I'm wondering if multipathd is > causing the call trace by not letting /dev/sde disappear so that the HBA's scsi > device can grab that name again. I noticed this via lsof: > multipath 8390 root 5r BLK 8,64 22254 > /dev/sde (deleted) > multipath 8390 root 6r BLK 8,16 1100 > /dev/sdb > multipath 8390 root 10r BLK 8,48 23647 > /dev/sdd > > When multipathd is running, unplugging and plugging in one of the ports causes > it to grab the next sd* device name. As this is repeated, the number of deleted > block devices multipathd holds on to grows, along with the number of unhappy > rdac checkers. As I said before, it takes a 'kill -9' to stop multipathd, and > subsequent plugging ins choose sd* names that were previously used but were held > onto as (deleted) by multipathd. > > However, this behavior is not seen when multipathd is not running. When the > port is unplugged, the /dev/sd* device disappears, and when it is plugged back > in, it takes the same name it had before (I assume it's just taking the lowest > name, and its old name has been freed) cleanly, with no call traces or anything. > > Any ideas on how to correct this behavior? > Hmm. multipathd really should react to the 'remove' events for sdX. Checking ... Looks as if it does. And it even is supposed to stop the path checker. Care to run multipathd with full debugging (ie -v 4) and post the output? My guess is that somehow the path checker is not stopped and the fd is kept open, so that the device is not released properly. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxx +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel