elbert at host1:~$ dpkg -l|grep glusterfs ii glusterfs-client 1.3.8-0pre2 GlusterFS fuse client ii glusterfs-server 1.3.8-0pre2 GlusterFS fuse server ii libglusterfs0 1.3.8-0pre2 GlusterFS libraries and translator modules I have 2 hosts set up to use AFR with the package versions listed above. I have been experiencing an issue where a file that is copied to glusterfs is readable/writable for a while, then at some point it time, it ceases to be. Trying to access it only retrieves the error message, "cannot open `filename' for reading: Input/output error". Files enter glusterfs either via the "cp" command from a client or via "rsync". In the case of cp, the clients are all local and copying across a very fast connection. In the case of rsync, the 1 client is itself a gluster client. We are testing out a later version of gluster, and it rsync's across a vpn. elbert at host2:~$ dpkg -l|grep glusterfs ii glusterfs-client 2.0.1-1 clustered file- system ii glusterfs-server 2.0.1-1 clustered file- system ii libglusterfs0 2.0.1-1 GlusterFS libraries and translator modules ii libglusterfsclient0 2.0.1-1 GlusterFS client library ========= What causes files to become inaccessible? I read that fstat() had a bug in version 1.3.x whereas stat() did not, and that it was being worked on. Could this be related? When a file becomes inaccessible, I have been manually removing the file from the mount point, then copying it back in via scp. Then the file becomes accessible. Below I've pasted a sample of what I'm seeing. > elbert at tool3.:hourlogs$ cd myDir > ls 1244682000.log > elbert at tool3.:myDir$ ls 1244682000.log > 1244682000.log > elbert at tool3.:myDir$ stat 1244682000.log > File: `1244682000.log' > Size: 40265114 Blocks: 78744 IO Block: 4096 regular file > Device: 15h/21d Inode: 42205749 Links: 1 > Access: (0755/-rwxr-xr-x) Uid: ( 1003/ elbert) Gid: ( 6000/ > ops) > Access: 2009-06-11 02:25:10.000000000 +0000 > Modify: 2009-06-11 02:26:02.000000000 +0000 > Change: 2009-06-11 02:26:02.000000000 +0000 > elbert at tool3.:myDir$ tail 1244682000.log > tail: cannot open `1244682000.log' for reading: Input/output error At this point, I am able to rm the file. Then, if I scp it back in, I am able to successfully tail it. So, I have observed cases where the files had a Size of 0, and otherwise they were in the same state. I'm not totally certain, but it looks like if a file gets into this state from rsync, either it gets deposited in this state immediately (before I try to read it), or else it quickly enters this state. Speaking generally, file sizes tend to be several MB up to 150 MB. Here's my server config: # Gluster Server configuration /etc/glusterfs/glusterfs-server.vol # Configured for AFR & Unify features volume brick type storage/posix option directory /var/gluster/data/ end-volume volume brick-ns type storage/posix option directory /var/gluster/ns/ end-volume volume server type protocol/server option transport-type tcp/server subvolumes brick brick-ns option auth.ip.brick.allow 165.193.245.*,10.11.* option auth.ip.brick-ns.allow 165.193.245.*,10.11.* end-volume Here's my client config: # Gluster Client configuration /etc/glusterfs/glusterfs-client.vol # Configured for AFR & Unify features volume brick1 type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 10.11.16.68 # IP address of the remote brick option remote-subvolume brick # name of the remote volume end-volume volume brick2 type protocol/client option transport-type tcp/client option remote-host 10.11.16.71 option remote-subvolume brick end-volume volume brick3 type protocol/client option transport-type tcp/client option remote-host 10.11.16.69 option remote-subvolume brick end-volume volume brick4 type protocol/client option transport-type tcp/client option remote-host 10.11.16.70 option remote-subvolume brick end-volume volume brick5 type protocol/client option transport-type tcp/client option remote-host 10.11.16.119 option remote-subvolume brick end-volume volume brick6 type protocol/client option transport-type tcp/client option remote-host 10.11.16.120 option remote-subvolume brick end-volume volume brick-ns1 type protocol/client option transport-type tcp/client option remote-host 10.11.16.68 option remote-subvolume brick-ns # Note the different remote volume name. end-volume volume brick-ns2 type protocol/client option transport-type tcp/client option remote-host 10.11.16.71 option remote-subvolume brick-ns # Note the different remote volume name. end-volume volume afr1 type cluster/afr subvolumes brick1 brick2 end-volume volume afr2 type cluster/afr subvolumes brick3 brick4 end-volume volume afr3 type cluster/afr subvolumes brick5 brick6 end-volume volume afr-ns type cluster/afr subvolumes brick-ns1 brick-ns2 end-volume volume unify type cluster/unify subvolumes afr1 afr2 afr3 option namespace afr-ns # use the ALU scheduler option scheduler alu # This option makes brick5 to be readonly, where no new files are created. ##option alu.read-only-subvolumes brick5## # Don't create files one a volume with less than 5% free diskspace option alu.limits.min-free-disk 10% # Don't create files on a volume with more than 10000 files open option alu.limits.max-open-files 10000 # When deciding where to place a file, first look at the disk-usage, then at # read-usage, write-usage, open files, and finally the disk-speed- usage. option alu.order disk-usage:read-usage:write-usage:open-files- usage:disk-speed-usage # Kick in if the discrepancy in disk-usage between volumes is more than 2GB option alu.disk-usage.entry-threshold 2GB # Don't stop writing to the least-used volume until the discrepancy is 1988MB option alu.disk-usage.exit-threshold 60MB # Kick in if the discrepancy in open files is 1024 option alu.open-files-usage.entry-threshold 1024 # Don't stop until 992 files have been written the least-used volume option alu.open-files-usage.exit-threshold 32 # Kick in when the read-usage discrepancy is 20% option alu.read-usage.entry-threshold 20% # Don't stop until the discrepancy has been reduced to 16% (20% - 4%) option alu.read-usage.exit-threshold 4% # Kick in when the write-usage discrepancy is 20% option alu.write-usage.entry-threshold 20% ## Don't stop until the discrepancy has been reduced to 16% option alu.write-usage.exit-threshold 4% # Refresh the statistics used for decision-making every 10 seconds option alu.stat-refresh.interval 10sec # Refresh the statistics used for decision-making after creating 10 files # option alu.stat-refresh.num-file-create 10 end-volume #writebehind improves write performance a lot volume writebehind type performance/write-behind option aggregate-size 131072 # in bytes subvolumes unify end-volume Has anyone seen this issue before? Any suggestions? Thanks, -elb- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zresearch.com/pipermail/gluster-users/attachments/20090611/da7e3a20/attachment-0001.htm>