Can you get a process state dump of the brick process hosting '0-vol_home-client-2' subvolume? That should give some clues about what happened to the missing rename call. Avati On Wed, Jun 13, 2012 at 7:02 AM, Jeff White <jaw171 at pitt.edu> wrote: > I recently upgraded my dev cluster to 3.3. To do this I copied the data > out of the old volume into a bare disk, wiped out everything about Gluster, > installed the 3.3 packages, create a new volume (I wanted to change my > brick layout), then copied the data back into the new volume. Previously > everything worked fine but now my users are complaining of random errors > when compiling software. > > I enabled debug logging for the clients and I see this: > > x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) > [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem > [2012-06-12 17:12:02.783526] D [mem-pool.c:457:mem_get] > (-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d] > (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0 > x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) > [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem > [2012-06-12 17:12:02.783584] D [mem-pool.c:457:mem_get] > (-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d] > (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0 > x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) > [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem > [2012-06-12 17:12:45.726083] D [client-handshake.c:184:client_start_ping] > 0-vol_home-client-0: returning as transport is already disconnected OR > there are no frames (0 || 0) > [2012-06-12 17:12:45.726154] D [client-handshake.c:184:client_start_ping] > 0-vol_home-client-3: returning as transport is already disconnected OR > there are no frames (0 || 0) > [2012-06-12 17:12:45.726171] D [client-handshake.c:184:client_start_ping] > 0-vol_home-client-1: returning as transport is already disconnected OR > there are no frames (0 || 0) > *[2012-06-12 17:15:35.888437] E [rpc-clnt.c:208:call_bail] > 0-vol_home-client-2: bailing out frame type(GlusterFS 3.1) op(RENAME(8)) > xid = 0x2015421x sent = 2012-06-12 16:45:26.237621. timeout = 1800* > [2012-06-12 17:15:35.888507] W > [client3_1-fops.c:2385:client3_1_rename_cbk] 0-vol_home-client-2: remote > operation failed: Transport endpoint is not connected > [2012-06-12 17:15:35.888529] W [dht-rename.c:478:dht_rename_cbk] > 0-vol_home-dht: > /sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp: > rename on vol_home-client-2 failed (Transport endpoint is not connected) > [2012-06-12 17:15:35.889803] W [fuse-bridge.c:1516:fuse_rename_cbk] > 0-glusterfs-fuse: 2776710: > /sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp > -> > /sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class > => -1 (Transport endpoint is not connected) > [2012-06-12 17:15:35.890002] D [mem-pool.c:457:mem_get] > (-->/usr/lib64/libglusterfs.so.0(dict_new+0xb) [0x36e3613d6b] > (-->/usr/lib64/libglusterfs.so.0(get_new_dict_full+0x27) [0x36e3613c67] > (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) [0x36e364018b]))) > 0-mem-pool: Mem pool is full. Callocing mem > [2012-06-12 17:15:35.890167] D [mem-pool.c:457:mem_get] > (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d) > [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) > [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) > [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem > [2012-06-12 17:15:35.890258] D [mem-pool.c:457:mem_get] > (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d) > [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) > [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) > [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem > [2012-06-12 17:15:35.890311] D [mem-pool.c:457:mem_get] > (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d) > [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) > [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) > [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem > [2012-06-12 17:15:35.890363] D [mem-pool.c:457:mem_get] > (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d) > [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) > [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) > [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem > ** and so on, more of the same... > > If I enable debug logging on the bricks I see thousands of these lines > every minute and I'm forced to disable the logging: > > [2012-06-12 15:32:45.760598] D [io-threads.c:268:iot_schedule] > 0-vol_home-io-threads: LOOKUP scheduled as fast fop > > Here's my config: > > # gluster volume info > Volume Name: vol_home > Type: Distribute > Volume ID: 07ec60be-ec0c-4579-a675-069bb34c12ab > Status: Started > Number of Bricks: 4 > Transport-type: tcp > Bricks: > Brick1: storage0-dev.cssd.pitt.edu:/brick/0 > Brick2: storage1-dev.cssd.pitt.edu:/brick/2 > Brick3: storage0-dev.cssd.pitt.edu:/brick/1 > Brick4: storage1-dev.cssd.pitt.edu:/brick/3 > Options Reconfigured: > diagnostics.brick-log-level: INFO > diagnostics.client-log-level: INFO > features.limit-usage: /home/cssd/jaw171:50GB,/cssd:200GB,/cssd/jaw171:75GB > nfs.rpc-auth-allow: 10.54.50.*,127.* > auth.allow: 10.54.50.*,127.* > performance.io-cache: off > cluster.min-free-disk: 5 > performance.cache-size: 128000000 > features.quota: on > nfs.disable: on > > # rpm -qa | grep gluster > glusterfs-fuse-3.3.0-1.el6.x86_64 > glusterfs-server-3.3.0-1.el6.x86_64 > glusterfs-3.3.0-1.el6.x86_64 > > Name resolution is fine on everything, everything can ping everything else > by name, no firewalls are running anywhere, there's no disk errors on the > storage nodes. > > Did the way I copied data out of one volume and back into another cause > this (some xattr problem)? What else could be causing this problem? I'm > looking to go production with GlusterFS on a 242 (soon to grow) node HPC > cluster at the end of this month. > > Also, one of my co-workers improved upon an existing remote quota viewer > written in Python. I'll post the code soon for those interested. > > -- > Jeff White - Linux/Unix Systems Engineer > University of Pittsburgh - CSSD > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120613/f81e5164/attachment-0001.htm>