Hi Ben, On Wed, 2018-11-21 at 12:20 +0100, Hannes Reinecke wrote: > On 11/21/18 11:50 AM, Martin Wilck wrote: > > Hi Ben, > > > > I recently encountered a strange and disturbing phenomenon with the > > latest upstream code. > > > > The system in question had two iSCSI connections on separate > > networks. > > On one of them, I tested iSCSI disconnect and re-connects, and > > broken > > storage configurations (different LUNs diffenent sizes and with the > > same WWID). The other one was idle. > > > > During one test, I observed that multipathd's internal > > representation > > of the map representing the *idle* connection had vanished. The map > > was > > still present on the device-mapper level, but multipathd seemed to > > have > > "forgotten" it completely. The respective paths were tracked as > > orphans. > > > > Here is a log excerpt. multipathd is working on path additions for > > the > > map I was testing with (149455...), when a DM event is received. > > Right > > after ev_add_path() releases the lock, the idle map > > (36001405...) is removed (not flushed, as the messages suggest). > > > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: add path > > (uevent) > > # uev_add_path->pathinfo() > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: udev > > property SCSI_IDENT_LUN_T10 whitelisted > > # ev_add_path()->adopt_paths() > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sde: ownership > > set to 149455400000000006d706174683100000000000000000000 > > ... > > # ev_add_path->setup_map() > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: > > 149455400000000006d706174683100000000000000000000: failback = > > "manual" (setting: multipath internal) > > # ev_add_path->domap() > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: > > 149455400000000006d706174683100000000000000000000: load table [0 > > 2097152 multipath 0 0 3 1 service-time 0 1 1 8:64 ... > > # ev_add_path->setup_multipath()->update-multipath_strings > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: > > 149455400000000006d706174683100000000000000000000: disassemble map > > [0 0 3 1 service-time 0 1 2 8:64 1 1 service-time ... > > # (!!!!) devmap event for LIO map > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: > > 36001405d08f748dd4c5481f9ec22b888: devmap event #20 > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg [8:96]: > > path added to devmap > > 149455400000000006d706174683100000000000000000000 > > # ev_add_path() returns, vecs->lock is released > > # uev_add_path() goes on to next path, lock currently not held > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdh: add path > > (uevent) > > # (!!!!) The LIO map is flushed (36001405d08f748dd4c5481f9ec22b888) > > # this looks like remove_map() or remove_map_and_stop_waiter() > > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdd: orphan > > > path, map flushed > > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sysfs > > > prioritizer refcount 2 > > > Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdc: orphan > > > path, map flushed > > > > To me this looks like the map was removed by the DM event handler. > > More detail can be found on > > https://bugzilla.opensuse.org/show_bug.cgi?id=1116272. > > I've also tried to reproduce the problem, with no success so far > > (actually I haven't observed any dmevents in those reproduction > > attempts). > > > > After encountering this problem, I've scrutinized your polling > > dmevent > > handler code over and over last week, but I couldn't spot any > > possible > > problems :-) > > > > The concern that I have is with the design of the polling dmevent > > API, > > in particular the fact that the default action is EVENT_REMOVE. If, > > for > > whatever reason, one map is missing from the return value of the > > DM_DEVICE_LIST ioctl, multipath removes the map immediately, and > > there's nothing short of a "reconfigure" or "add map" CLI command > > that > > would reinstate the map. IOW, we remove the map not on a kernel > > event > > saying "this map has been removed", but on receiving a list where > > this > > element happens to be missing. We handle this differently for > > paths, > > where we wait for a "remove" uevent before we really delete the > > path > > from our data structures. (Note that the messages file I saved from > > the > > event above shows no sign of such an uevent ever beeing sent - as I > > said, the dm map was still present after the above occured). > > > > What do you think about this? Could you maybe inspect those logs I > > took, to make sure I didn't get on a totally wrong track here? > > > There's an easy solution to it: drop the dmevent handling code in > multipath-tools completely. > It's original design was to track _external_ map reloads, as > originally > the maps had been setup by 'multipath', and 'multipahtd' was just > tracking state changes. As such multipathd needed to track those > changes, and it had been using dmevents for this. > However, with the advent of udev and updates to the multipathd > daemon > external map reloads rarely happens, and even if they happen we > still > would be notified by uevents. > So we might as well drop the dmevent code and rely on udev > completely. > Which is what we do nowadays anyway. I like this idea, in particular for map removal. Ben, what do you think? Martin -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel