On Thu, 2019-03-21 at 17:31 -0500, Benjamin Marzinski wrote: > > ideally, we would be able to determine whether or not udev was able > to > get all the necessary information. It would be nice to be notified if > scsi_id failed or udev timed out. The latter (udev timeout) won't work without significant changes to udev/systemd. Currently udevd simply kills hanging workers with SIGKILL, and doesn't bother to send an "incomplete uevent" message to monitor listeners. What we *might* do in multipathd is listen for kernel uevents _and_ udev uevents, and figuring out possible problems by relating and comparing them. It would be a non-trivial task; we'd have to deal with the possibility that kernel events might get lost, or might have happened before multipathd was started. During startup, this wouldn't help us. Rather, it would enable us to react on events during runtime which we currently miss. I see this as a feature enhancement - possible, but really hard to get right, and not directly related to the problems we are currently dealing with. If we look at udev information for a block device (either during device probing or uevent processing), and the WWID is not set but some other properties (e.g. ID_PATH) are, we can be pretty sure that scsi_id or sg_inq have failed during processing of the last uevent for that device. If, in this case, we used our fallback action to retrieve the WWID, we'd be able to determine if it was a very short-lived problem or something more serious. > > Whatever we do, we should stop trying to "fix" the path WWID in > > disassemble_map(). That's *so* against the separation of concerns > > principle. In getuid(), we might check if a path with missing WWID > > is > > already part of an existing multipath map, and then set the path > > WWID > > from the map WWID as sort-of a last emergency fallback. But that, > > too, > > should only be done during startup (assuming that a previous > > multipath > > or multipathd instance had set up the map correctly, and that udev > > information had been "lost" since then), and only after retrying as > > described above. > > We don't want to remove paths from multipath devices because > multipathd > started up when the path was missing udev information. The udev > properties are trickier, but if we simply have a null WWID, it makes > sense to allow it as a last resort if the device otherwise appears to > have the same paths as it previously did. I agree, but I don't agree with filling in pp->wwid from mpp->wwid. > users can always run > > # multipath -f > > to remove the device. If it looks like some of the paths are supposed > to > change on the device, we should quite possibly not include paths with > a > null WWID, because we don't know what has changed. But we can do > this > someplace else than in disassemble_map(). How would you determine that something is "supposed to change"? > > > Note that since by-property blacklisting was introduced in 2013, > > significant progress has been made in other areas. We have > > blacklisting > > by transport now, "find_multipaths", the "failed_wwids" logic that > > avoids repeated attempts at setting up maps for busy devices, and > > the > > INIT_MISSING_UDEV logic to deal with incomplete initialization. The > > udev rules have been improved as well. So, doing away with > > "required > > udev properties" may not be so dangerous, after all. > > > > Thoughts? > > Another option would be to do some extra work in reconfigure. If we > held on to the old path, and cleaned up everything but the old udev > device and file descriptor, we could be sure that the kernel wouldn't > reuse that device major:minor while we were reconfiguring. If we got > some paths without their udev information, we would have the old udev > information to check against the new config, to see if the device > should > be removed. Again, this works best if we could determine if we were > missing udev information. Although in this case we could probably > just > use any path that became blacklisted because of not having the > necessary > property information. IOW, we shouldn't blacklist these paths, which is what I was trying to say. Your idea to hold the references until reconfigure() is finished sounds clever to me. The idea of blacklisting-by-missing-properties is to determine if ID_SERIAL for a given WWID is _reliable_. It should be used if we get a non-zero ID_SERIAL and at the same time none of the required properties, e.g. ID_WWN. IMO, if this is the case, we can be certain that scsi_id did *not* fail - after all, it was able to obtain ID_SERIAL. OTOH, if neither ID_SERIAL nor ID_WWN is set, failure to access the device is likely. Thus the solution here is simple: We should apply "blacklisting by missing property" *only* if ID_SERIAL is set, but ID_WWN is not. Martin -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel