Re: NAA breakage

Martin Svec <martin.svec@xxxxxxxx> · Wed, 17 Aug 2011 13:50:53 +0200

Hi Nicholas,

Mmmm, I think the right solution here would be ignoring the extra '-'
characters here at the point that the vpd_unit_serial attribute is set
via configfs..  However, this would still obviously still cause an issue
of the NAA WWN changing..

I think the following points should be solved:

(a) How many existing production setups can be affected in the same 
way as my lab cluster? My setup is quite special because I run LIO on 
top of active/passive DRBD, generate my own serials to maintain LUN 
identities across DRBD nodes, access configfs plane directly using my 
own library instead of rtsadmin/lio-utils etc. I can easily change the 
serial number generator because we don't use LIO in production yet, 
but it does not solve the problem for others.

(b) Are there any restrictions for vpd_unit_serial format in T10 
specifications? Now, afaik configfs allows me to set an arbitrary 
string...

(c) If there are no restrictions for the serial number format, NAA 
should be probably generated using a hash function (e.g. SHA) instead 
of hex2bin. The present implementation can easily produce identical 
NAAs for two different serial numbers which is really bad.

(d) IMHO this issue should be solved during this mainline release, 
because the growing number of LIO target users will make future fixes 
harder.

How severe is the breakage with VMWare here when the NAA WWN changes..?
Does this require a logout ->  relogin from the perspective of the ESX
client..?  Or does this cause issues with on-disk metadata for VMFS that
references existing NAA WWNs..?

Well, first of all, I'm not a VMware expert. Based upon my tests and 
research in last two days, this is a serious headache for VMware 
ESX(i). ESX >=3.5 uses NAA identifier as a guaranteed unique signature 
of a physical volume and saves a copy of NAA to VMFS header. When 
establishing a storage session, on-disk signatures of VMFS extents are 
compared with the actual NAAs presented by the storage to avoid data 
corruption, maintain multiple paths to a single volume etc.

In practice, when I changed NAA of an active VMFS volume with running 
VMs, it resulted in an unrecoverable error (see kb.vmware.com/kb/1003416):

"ALERT: NMP: vmk_NmpVerifyPathUID: The physical media represented by 
device naa.600140535a4c2c4daa90dd591dc453dd (path vmhba34:C0:T0:L8) 
has changed. If this is a data LUN, this is a critical error."

I didn't test NAA change of an inactive unmounted VMFS volume, but I 
expect that VMware will treat such a volume as a storage snapshot and 
its resignature will be needed. See kb.vmware.com/kb/1011387 or 
http://holyhandgrenade.org/blog/2010/07/practical-vmfs-signatures/ 
blogpost.

In all cases, nontrivial effort is probably necessary to make it work 
again. It seems to me that the easiest solution (and the only solution 
without downtime) is to migrate all VMs to another shared storage 
using Storage vMotion, destroy the VMFS volume, change NAA, recreate 
VMFS and migrate VMs back. (But if somebody else know an easy way to 
restore active VMFS volume after NAA change, please tell me :-))

Martin

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html