3.3.2 bug? "gfid different on <vol>"

jbd at podomatic.com (Justin Dossey) · Thu, 17 Oct 2013 10:03:39 -0700

Hello,

I'm currently running a data migration, transferring about 450,000
directories to my GlusterFS cluster.  Right now, I have one
distributed-replicated volume with six bricks.

To migrate my files, I am running a simple shell script that does something
like this

mkdir <new directory>
rsync -a --inplace <old directory> <new directory>
mv <old directory> <old_directory>.old
ln -s <new directory> <old directory>

The old directory structure is on a non-GlusterFS NFS mount.  It is too
flat (all those directories are in one subdir) and the new structure is
hashed with a maximum of 256 subdirs per directory.

To speed the migration along, I'm running it in parallel (with GNU
Parallel) at a level of 16 (so I move 16 people simultaneously).  As I am
writing a lot of files to GlusterFS, I am using the native (fuse) client.

After a few minutes, I get a funky issue-- one of my file operations
becomes totally unresponsive.  Looking at the GlusterFS log on the client
shows me messages like this:

[2013-10-17 16:30:32.334799] W [dht-common.c:416:dht_lookup_dir_cbk]
0-UDS8-dht: /cache/ts/5a/5f/05/elchicote: gfid different on UDS8-replicate-0
[2013-10-17 16:30:32.335463] W [dht-common.c:416:dht_lookup_dir_cbk]
0-UDS8-dht: /cache/ts/5a/5f/05/elchicote: gfid different on UDS8-replicate-1

The hung call in this case is a mkdir of /cache/ts/5a/5f/05/elchicote.

Attempting to list /cache/ts/5f/05/elchicote on the client causes ls to
hang.

I haven't done any gfid munging here, so I am very concerned that GlusterFS
3.3.2 has this bug.

-- 
Justin Dossey
CTO, PodOmatic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131017/f92c7120/attachment.html>