Hi Nicholas,
Mmmm, I think the right solution here would be ignoring the extra '-'
characters here at the point that the vpd_unit_serial attribute is set
via configfs.. However, this would still obviously still cause an issue
of the NAA WWN changing..
I think the following points should be solved:
(a) How many existing production setups can be affected in the same
way as my lab cluster? My setup is quite special because I run LIO on
top of active/passive DRBD, generate my own serials to maintain LUN
identities across DRBD nodes, access configfs plane directly using my
own library instead of rtsadmin/lio-utils etc. I can easily change the
serial number generator because we don't use LIO in production yet,
but it does not solve the problem for others.
(b) Are there any restrictions for vpd_unit_serial format in T10
specifications? Now, afaik configfs allows me to set an arbitrary
string...
(c) If there are no restrictions for the serial number format, NAA
should be probably generated using a hash function (e.g. SHA) instead
of hex2bin. The present implementation can easily produce identical
NAAs for two different serial numbers which is really bad.
(d) IMHO this issue should be solved during this mainline release,
because the growing number of LIO target users will make future fixes
harder.
How severe is the breakage with VMWare here when the NAA WWN changes..?
Does this require a logout -> relogin from the perspective of the ESX
client..? Or does this cause issues with on-disk metadata for VMFS that
references existing NAA WWNs..?
Well, first of all, I'm not a VMware expert. Based upon my tests and
research in last two days, this is a serious headache for VMware
ESX(i). ESX >=3.5 uses NAA identifier as a guaranteed unique signature
of a physical volume and saves a copy of NAA to VMFS header. When
establishing a storage session, on-disk signatures of VMFS extents are
compared with the actual NAAs presented by the storage to avoid data
corruption, maintain multiple paths to a single volume etc.
In practice, when I changed NAA of an active VMFS volume with running
VMs, it resulted in an unrecoverable error (see kb.vmware.com/kb/1003416):
"ALERT: NMP: vmk_NmpVerifyPathUID: The physical media represented by
device naa.600140535a4c2c4daa90dd591dc453dd (path vmhba34:C0:T0:L8)
has changed. If this is a data LUN, this is a critical error."
I didn't test NAA change of an inactive unmounted VMFS volume, but I
expect that VMware will treat such a volume as a storage snapshot and
its resignature will be needed. See kb.vmware.com/kb/1011387 or
http://holyhandgrenade.org/blog/2010/07/practical-vmfs-signatures/
blogpost.
In all cases, nontrivial effort is probably necessary to make it work
again. It seems to me that the easiest solution (and the only solution
without downtime) is to migrate all VMs to another shared storage
using Storage vMotion, destroy the VMFS volume, change NAA, recreate
VMFS and migrate VMs back. (But if somebody else know an easy way to
restore active VMFS volume after NAA change, please tell me :-))
Martin
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html