Also, in the logfiles on the clients, it looks like I get these types of messages whenever I try to access a file that is no longer accessible. 2009-06-11 07:58:24 E [fuse-bridge.c:675:fuse_fd_cbk] glusterfs-fuse: 22068570: /hourlogs/myDir0/1243432800.log => -1 (5) 2009-06-11 07:58:24 E [fuse-bridge.c:436:fuse_entry_cbk] glusterfs- fuse: 22068579: /hourlogs/myDir1/1243400400.log => -1 (116) 2009-06-11 07:58:24 E [unify.c:850:unify_open] unify: /hourlogs/ myDir1/1243400400.log: entry_count is 3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir1/1243400400.log: found on afr3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir1/1243400400.log: found on afr2 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir1/1243400400.log: found on afr-ns 2009-06-11 07:58:24 E [fuse-bridge.c:675:fuse_fd_cbk] glusterfs-fuse: 22068580: /hourlogs/myDir1/1243400400.log => -1 (5) 2009-06-11 07:58:24 E [fuse-bridge.c:436:fuse_entry_cbk] glusterfs- fuse: 22068583: /hourlogs/myDir2/1243411200.log => -1 (116) 2009-06-11 07:58:24 E [unify.c:850:unify_open] unify: /hourlogs/ myDir2/1243411200.log: entry_count is 3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir2/1243411200.log: found on afr1 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir2/1243411200.log: found on afr3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir2/1243411200.log: found on afr-ns 2009-06-11 07:58:24 E [fuse-bridge.c:675:fuse_fd_cbk] glusterfs-fuse: 22068584: /hourlogs/myDir2/1243411200.log => -1 (5) 2009-06-11 07:58:24 E [fuse-bridge.c:436:fuse_entry_cbk] glusterfs- fuse: 22068599: /hourlogs/myDir3/1243472400.log => -1 (116) 2009-06-11 07:58:24 E [unify.c:850:unify_open] unify: /hourlogs/ myDir3/1243472400.log: entry_count is 3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir3/1243472400.log: found on afr1 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir3/1243472400.log: found on afr3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir3/1243472400.log: found on afr-ns 2009-06-11 07:58:24 E [fuse-bridge.c:675:fuse_fd_cbk] glusterfs-fuse: 22068600: /hourlogs/myDir3/1243472400.log => -1 (5) 2009-06-11 07:58:24 E [fuse-bridge.c:436:fuse_entry_cbk] glusterfs- fuse: 22068603: /hourlogs/myDir4/1243404000.log => -1 (116) 2009-06-11 07:58:24 E [unify.c:850:unify_open] unify: /hourlogs/ myDir4/1243404000.log: entry_count is 3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir4/1243404000.log: found on afr1 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir4/1243404000.log: found on afr-ns 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir4/1243404000.log: found on afr3 2009-06-11 07:58:24 E [fuse-bridge.c:675:fuse_fd_cbk] glusterfs-fuse: 22068604: /hourlogs/myDir5/1243404000.log => -1 (5) 2009-06-11 07:58:24 E [fuse-bridge.c:436:fuse_entry_cbk] glusterfs- fuse: 22068619: /hourlogs/myDir5/1243447200.log => -1 (116) 2009-06-11 07:58:24 E [unify.c:850:unify_open] unify: /hourlogs/ myDir5/1243447200.log: entry_count is 4 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir5/1243447200.log: found on afr1 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir5/1243447200.log: found on afr3 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir5/1243447200.log: found on afr2 2009-06-11 07:58:24 E [unify.c:853:unify_open] unify: /hourlogs/ myDir5/1243447200.log: found on afr-ns 2009-06-11 07:58:24 E [fuse-bridge.c:675:fuse_fd_cbk] glusterfs-fuse: 22068620: /hourlogs/myDir5/1243447200.log => -1 (5) On Jun 11, 2009, at 10:33 AM, Elbert Lai wrote: > elbert at host1:~$ dpkg -l|grep glusterfs > ii glusterfs-client > 1.3.8-0pre2 GlusterFS fuse client > ii glusterfs-server > 1.3.8-0pre2 GlusterFS fuse server > ii libglusterfs0 > 1.3.8-0pre2 GlusterFS libraries and > translator modules > > I have 2 hosts set up to use AFR with the package versions listed > above. I have been experiencing an issue where a file that is copied > to glusterfs is readable/writable for a while, then at some point it > time, it ceases to be. Trying to access it only retrieves the error > message, "cannot open `filename' for reading: Input/output error". > > Files enter glusterfs either via the "cp" command from a client or > via "rsync". In the case of cp, the clients are all local and > copying across a very fast connection. In the case of rsync, the 1 > client is itself a gluster client. We are testing out a later > version of gluster, and it rsync's across a vpn. > > elbert at host2:~$ dpkg -l|grep glusterfs > ii glusterfs-client 2.0.1-1 clustered file- > system > ii glusterfs-server 2.0.1-1 clustered file- > system > ii libglusterfs0 2.0.1-1 GlusterFS > libraries and translator modules > ii libglusterfsclient0 2.0.1-1 GlusterFS client > library > > ========= > What causes files to become inaccessible? I read that fstat() had a > bug in version 1.3.x whereas stat() did not, and that it was being > worked on. Could this be related? > > When a file becomes inaccessible, I have been manually removing the > file from the mount point, then copying it back in via scp. Then the > file becomes accessible. Below I've pasted a sample of what I'm > seeing. > >> elbert at tool3.:hourlogs$ cd myDir >> ls 1244682000.log >> elbert at tool3.:myDir$ ls 1244682000.log >> 1244682000.log >> elbert at tool3.:myDir$ stat 1244682000.log >> File: `1244682000.log' >> Size: 40265114 Blocks: 78744 IO Block: 4096 regular file >> Device: 15h/21d Inode: 42205749 Links: 1 >> Access: (0755/-rwxr-xr-x) Uid: ( 1003/ elbert) Gid: >> ( 6000/ ops) >> Access: 2009-06-11 02:25:10.000000000 +0000 >> Modify: 2009-06-11 02:26:02.000000000 +0000 >> Change: 2009-06-11 02:26:02.000000000 +0000 >> elbert at tool3.:myDir$ tail 1244682000.log >> tail: cannot open `1244682000.log' for reading: Input/output error > > At this point, I am able to rm the file. Then, if I scp it back in, > I am able to successfully tail it. > > So, > > I have observed cases where the files had a Size of 0, and otherwise > they were in the same state. I'm not totally certain, but it looks > like if a file gets into this state from rsync, either it gets > deposited in this state immediately (before I try to read it), or > else it quickly enters this state. Speaking generally, file sizes > tend to be several MB up to 150 MB. > > Here's my server config: > # Gluster Server configuration /etc/glusterfs/glusterfs-server.vol > # Configured for AFR & Unify features > > volume brick > type storage/posix > option directory /var/gluster/data/ > end-volume > > volume brick-ns > type storage/posix > option directory /var/gluster/ns/ > end-volume > > volume server > type protocol/server > option transport-type tcp/server > subvolumes brick brick-ns > option auth.ip.brick.allow 165.193.245.*,10.11.* > option auth.ip.brick-ns.allow 165.193.245.*,10.11.* > end-volume > > Here's my client config: > # Gluster Client configuration /etc/glusterfs/glusterfs-client.vol > # Configured for AFR & Unify features > > volume brick1 > type protocol/client > option transport-type tcp/client # for TCP/IP transport > option remote-host 10.11.16.68 # IP address of the remote brick > option remote-subvolume brick # name of the remote volume > end-volume > > volume brick2 > type protocol/client > option transport-type tcp/client > option remote-host 10.11.16.71 > option remote-subvolume brick > end-volume > > volume brick3 > type protocol/client > option transport-type tcp/client > option remote-host 10.11.16.69 > option remote-subvolume brick > end-volume > > volume brick4 > type protocol/client > option transport-type tcp/client > option remote-host 10.11.16.70 > option remote-subvolume brick > end-volume > > volume brick5 > type protocol/client > option transport-type tcp/client > option remote-host 10.11.16.119 > option remote-subvolume brick > end-volume > > volume brick6 > type protocol/client > option transport-type tcp/client > option remote-host 10.11.16.120 > option remote-subvolume brick > end-volume > > volume brick-ns1 > type protocol/client > option transport-type tcp/client > option remote-host 10.11.16.68 > option remote-subvolume brick-ns # Note the different remote > volume name. > end-volume > > volume brick-ns2 > type protocol/client > option transport-type tcp/client > option remote-host 10.11.16.71 > option remote-subvolume brick-ns # Note the different remote > volume name. > end-volume > > volume afr1 > type cluster/afr > subvolumes brick1 brick2 > end-volume > > volume afr2 > type cluster/afr > subvolumes brick3 brick4 > end-volume > > volume afr3 > type cluster/afr > subvolumes brick5 brick6 > end-volume > > volume afr-ns > type cluster/afr > subvolumes brick-ns1 brick-ns2 > end-volume > > volume unify > type cluster/unify > subvolumes afr1 afr2 afr3 > option namespace afr-ns > > # use the ALU scheduler > option scheduler alu > > # This option makes brick5 to be readonly, where no new files are > created. > ##option alu.read-only-subvolumes brick5## > > # Don't create files one a volume with less than 5% free diskspace > option alu.limits.min-free-disk 10% > > # Don't create files on a volume with more than 10000 files open > option alu.limits.max-open-files 10000 > > # When deciding where to place a file, first look at the disk- > usage, then at > # read-usage, write-usage, open files, and finally the disk-speed- > usage. > option alu.order disk-usage:read-usage:write-usage:open-files- > usage:disk-speed-usage > > # Kick in if the discrepancy in disk-usage between volumes is more > than 2GB > option alu.disk-usage.entry-threshold 2GB > > # Don't stop writing to the least-used volume until the discrepancy > is 1988MB > option alu.disk-usage.exit-threshold 60MB > > # Kick in if the discrepancy in open files is 1024 > option alu.open-files-usage.entry-threshold 1024 > > # Don't stop until 992 files have been written the least-used volume > option alu.open-files-usage.exit-threshold 32 > > # Kick in when the read-usage discrepancy is 20% > option alu.read-usage.entry-threshold 20% > > # Don't stop until the discrepancy has been reduced to 16% (20% - 4%) > option alu.read-usage.exit-threshold 4% > > # Kick in when the write-usage discrepancy is 20% > option alu.write-usage.entry-threshold 20% > > ## Don't stop until the discrepancy has been reduced to 16% > option alu.write-usage.exit-threshold 4% > > # Refresh the statistics used for decision-making every 10 seconds > option alu.stat-refresh.interval 10sec > > # Refresh the statistics used for decision-making after creating 10 > files > # option alu.stat-refresh.num-file-create 10 > end-volume > > > #writebehind improves write performance a lot > volume writebehind > type performance/write-behind > option aggregate-size 131072 # in bytes > subvolumes unify > end-volume > > Has anyone seen this issue before? Any suggestions? > > Thanks, > -elb- > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zresearch.com/pipermail/gluster-users/attachments/20090611/5e41e87a/attachment.htm>