On 02/22/2012 07:22 AM, Dan Bretherton wrote: > I would really appreciate a quick Yes/No answer to the most important > question - is it safe to create, modify and delete files in a volume > during a fix-layout operation after an expansion? My first reaction is to say yes, that should generally be safe. Fix-layout changes some xattrs on directories that control where new files are placed, but even if a file ends up on the "wrong" brick the algorithm to find it anyway is pretty robust. It is possible to see layout anomalies if a client looks up a directory at the exact moment that its layout xattrs are being updated, but that should be a very rare and transient case. OTOH, the log entries below do seem to indicate that there's something going on that I don't understand. I'll dig a bit, and let you know if I find anything to change my mind wrt the safety of restoring write access. > > The users are champing at the bit waiting for me to let them have write > access, but fix-layout is likely to take several days based on previous > experience. > > -Dan > > On 02/22/2012 02:52 AM, Dan Bretherton wrote: >> Dear All- >> There are a lot of the following type of errors in my client and NFS >> logs following a recent volume expansion. >> >> [2012-02-16 22:59:42.504907] I >> [dht-layout.c:682:dht_layout_dir_mismatch] 0-atmos-dht: subvol: >> atmos-replicate-0; inode layout - 0 - 0; disk layout - 9203501 >> 34 - 1227133511 >> [2012-02-16 22:59:42.534399] I [dht-common.c:524:dht_revalidate_cbk] >> 0-atmos-dht: mismatching layouts for /users/rle/TRACKTEMP/TRACKS >> [2012-02-16 22:59:42.534521] I >> [dht-layout.c:682:dht_layout_dir_mismatch] 0-atmos-dht: subvol: >> atmos-replicate-1; inode layout - 0 - 0; disk layout - 1227133 >> 512 - 1533916889 >> >> I have expanded the volume successfully many times in the past. I can >> think of several possible reasons why this one might have gone wrong, >> but without expert advice I am just guessing. >> >> 1) I did precautionary ext4 filesystem checks on all the bricks and >> found errors on some of them, mostly things like this: >> >> Pass 1: Checking inodes, blocks, and sizes >> Inode 104386076, i_blocks is 3317792, should be 3317800. Fix? yes >> >> 2) I always use hostname.domain for new GlusterFS servers when doing >> "gluster peer probe HOSTNAME" (e.g. gluster peer probe >> bdan14.nerc-essc.ac.uk). I normally use hostname.domain (e.g. >> bdan14.nerc-essc.ac.uk) when creating volumes or adding bricks as >> well, but for the last brick I added I just used the hostname >> (bdan14). I can do "ping bdan14" from all the servers and clients, >> and the only access to the volume from outside my subnetwork is via NFS. >> >> 3) I found some old GlusterFS client processes still running, probably >> left over from previous occasions when the volume was auto-mounted. I >> have seen this before and I don't know why it happens, but normally I >> just kill unwanted glusterfs processes without affecting the mount. >> >> 4) I recently started using more than one server to export the volume >> via NFS in order to spread the load. In other words, two NFS clients >> may mount the same volume exported from two different servers. I don't >> remember reading anywhere that this is not allowed, but as this is a >> recent change I thought it would be worth checking. >> >> 5) I normally let people carry on using a volume while a fix-layout >> process is going on in the background. I don't remember reading that >> this is not allowed but I thought it worth checking. I don't do >> migrate-data after fix-layout because it doesn't work on my cluster. >> Normally the fix-layout completes without error and no "mismatching >> layout" errors are observed. However the volume is now so large that >> fix-layout usually takes several days to complete, and that means that >> a lot more files are created and modified during fix-layout than >> before. Could the continued use of the volume during the lengthy >> fix-layout be causing the layout errors? >> >> I have run fix-layout 3 times now and the second attempt crashed. All >> I can think of doing is to try again now that several back-end >> filesystems have been repaired. Could any of the above factors have >> caused the layout errors, and can anyone suggest a better way to >> remove them? All comments and suggestions would be much appreciated. >> >> Regards >> Dan. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users