On Wed, 2011-08-17 at 13:50 +0200, Martin Svec wrote: > Hi Nicholas, > > > Mmmm, I think the right solution here would be ignoring the extra '-' > > characters here at the point that the vpd_unit_serial attribute is set > > via configfs.. However, this would still obviously still cause an issue > > of the NAA WWN changing.. > > I think the following points should be solved: > > (a) How many existing production setups can be affected in the same > way as my lab cluster? My setup is quite special because I run LIO on > top of active/passive DRBD, generate my own serials to maintain LUN > identities across DRBD nodes, access configfs plane directly using my > own library instead of rtsadmin/lio-utils etc. I can easily change the > serial number generator because we don't use LIO in production yet, > but it does not solve the problem for others. > Yes, stripping out the '-' from the vpd_unit_serial value from the userspace code setting this value, and configfs parsing code as an extra safeguard makes the most sense.. > (b) Are there any restrictions for vpd_unit_serial format in T10 > specifications? Now, afaik configfs allows me to set an arbitrary > string... > This is defined as a VENDOR SPECIFIC IDENTIFIER for the NAA IEEE Registered Extended designator format, so the answer to that would be no.. > (c) If there are no restrictions for the serial number format, NAA > should be probably generated using a hash function (e.g. SHA) instead > of hex2bin. The present implementation can easily produce identical > NAAs for two different serial numbers which is really bad. > Well, we currently expect userspace to be in charge of generating a UUID that is unique for vpd_serial_number. hex2bin is simply doing the conversion for the NAA designator here, and should not be in charge of making sure what's set in vpd_wwn_serial is really unique. > (d) IMHO this issue should be solved during this mainline release, > because the growing number of LIO target users will make future fixes > harder. > > > How severe is the breakage with VMWare here when the NAA WWN changes..? > > Does this require a logout -> relogin from the perspective of the ESX > > client..? Or does this cause issues with on-disk metadata for VMFS that > > references existing NAA WWNs..? > > Well, first of all, I'm not a VMware expert. Based upon my tests and > research in last two days, this is a serious headache for VMware > ESX(i). ESX >=3.5 uses NAA identifier as a guaranteed unique signature > of a physical volume and saves a copy of NAA to VMFS header. When > establishing a storage session, on-disk signatures of VMFS extents are > compared with the actual NAAs presented by the storage to avoid data > corruption, maintain multiple paths to a single volume etc. > > In practice, when I changed NAA of an active VMFS volume with running > VMs, it resulted in an unrecoverable error (see kb.vmware.com/kb/1003416): > > "ALERT: NMP: vmk_NmpVerifyPathUID: The physical media represented by > device naa.600140535a4c2c4daa90dd591dc453dd (path vmhba34:C0:T0:L8) > has changed. If this is a data LUN, this is a critical error." > Ugh, so this is actually written into the VMFS header. So to verify this again, this only happens when you try to upgrade when the VMFS is active and mounted, correct..? > I didn't test NAA change of an inactive unmounted VMFS volume, but I > expect that VMware will treat such a volume as a storage snapshot and > its resignature will be needed. See kb.vmware.com/kb/1011387 or > http://holyhandgrenade.org/blog/2010/07/practical-vmfs-signatures/ > blogpost. > Ok, expecting folks to have to unmount an VMFS volume before upgrading to a v3.1 (eg: new major release) kernel on the target is not completely out of the question. Would it be possible for you to verify if this is really the case. > In all cases, nontrivial effort is probably necessary to make it work > again. It seems to me that the easiest solution (and the only solution > without downtime) is to migrate all VMs to another shared storage > using Storage vMotion, destroy the VMFS volume, change NAA, recreate > VMFS and migrate VMs back. (But if somebody else know an easy way to > restore active VMFS volume after NAA change, please tell me :-)) > Ok, i'll likely have to end up reverting this to the old logic to avoid this all together for -rc3, but I would really like to know the severity of this first for the 'inactive umounted VMFS volume' case. I think that forcing existing users to have to do this is not completely out of the question when upgrading from out-of-tree -> mainline code, but we need to ensure that VMFS is intelligent enough to recover from this in the first place. Thank you for your input here, --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html