Re: glusterfs client problem

Shawn Northart <shawn@xxxxxxxxxxxxxxxxxx> · Tue, 03 Apr 2007 11:49:48 -0700

sorry, forgot that one.   the command used was:
rsync -av --stats --progress --delete

i haven't tried setting a bwlimit yet and i'd prefer not to have to if
possible.   i've got roughly 450GB of data i want to sync over and the
faster i can do it, the better.   i will try it just to see if it makes
things any better.    the network is all copper gig with both interfaces
trunked and vlan'd (on both client and server).
a couple of other things that just came to mind are that i didn't see
this exact behavior during the initial rsync.   i have three directories
i'm trying to sync and when run concurrently, i would see the problem.
when run one at a time, the sync would seem to complete without
incident.   the only difference in the command i ran was that i omitted
the --delete flag.

~Shawn

On Tue, 2007-04-03 at 11:07 +0530, Krishna Srinivas wrote:
> Hi Shawn,
> 
> Can you give us the exact rsync command you used?
> 
> Thanks
> Krishna
> 
> On 4/3/07, Shawn Northart <shawn@xxxxxxxxxxxxxxxxxx> wrote:
> > I'm noticing a problem with our test setup with regard to (reasonably)
> > heavy read/write usage.
> > the probelem we're having is that during an rsync of content, the sync
> > bails due to the mount being lost with the following errors:
> >
> > <snip>
> > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trailers" failed:
> > Transport endpoint is not connected (107)
> > rsync: recv_generator: mkdir
> > "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember" failed: Transport
> > endpoint is not connected (107)
> > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember" failed:
> > Transport endpoint is not connected (107)
> > rsync: recv_generator: mkdir
> > "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/bardoux" failed:
> > Transport endpoint is not connected (107)
> > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/bardoux"
> > failed: Transport endpoint is not connected (107)
> > rsync: recv_generator: mkdir
> > "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/images" failed:
> > Transport endpoint is not connected (107)
> > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/images"
> > failed: Transport endpoint is not connected (107)
> > rsync: recv_generator: mkdir
> > "/vol/vol0/sites/TESTSITE.com/htdocs/upgrade_trailers" failed: Transport
> > endpoint is not connected (107)
> > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/upgrade_trailers"
> > failed: Transport endpoint is not connected (107)
> > </snip>
> >
> > normal logging shows nothing either client or server-side, but running
> > logging in DEBUG mode shows the following at the end of the client log
> > right as it breaks:
> >
> > <snip>
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:213/gf_print_trace()]
> > debug-backtrace:Got signal (11), printing backtrace
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/usr/local/glusterfs-mainline/lib/libglusterfs.so.0(gf_print_trace+0x1f) [0x2a9556030f]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/lib64/tls/libc.so.6 [0x35b992e2b0]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/lib64/tls/libpthread.so.0(__pthread_mutex_destroy+0)
> > [0x35ba807ab0]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/cluster/afr.so [0x2a958b840c]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/protocol/client.so [0x2a957b06c2]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/protocol/client.so [0x2a957b3196]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/usr/local/glusterfs-mainline/lib/libglusterfs.so.0(epoll_iteration+0xf8) [0x2a955616f8]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:[glusterfs] [0x4031b7]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:/lib64/tls/libc.so.6(__libc_start_main+0xdb)
> > [0x35b991c3fb]
> > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > debug-backtrace:[glusterfs] [0x402bba]
> > </snip>
> >
> >
> > the server log shows the following at the time it breaks:
> > <snip>
> > [Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
> > libglusterfs:full_rw: 0 bytes r/w instead of 113
> > [Apr 02 15:30:09]
> > [DEBUG/protocol.c:244/gf_block_unserialize_transport()]
> > libglusterfs/protocol:gf_block_unserialize_transport: full_read of
> > header failed
> > [Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
> > protocol/server:cleaned up xl_private of 0x510470
> > [Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
> > tcp/server:destroying transport object for 192.168.0.96:1012 (fd=8)
> > [Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
> > libglusterfs:full_rw: 0 bytes r/w instead of 113
> > [Apr 02 15:30:09]
> > [DEBUG/protocol.c:244/gf_block_unserialize_transport()]
> > libglusterfs/protocol:gf_block_unserialize_transport: full_read of
> > header failed
> > [Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
> > protocol/server:cleaned up xl_private of 0x510160
> > [Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
> > tcp/server:destroying transport object for 192.168.0.96:1013 (fd=7)
> > [Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
> > libglusterfs:full_rw: 0 bytes r/w instead of 113
> > [Apr 02 15:30:09]
> > [DEBUG/protocol.c:244/gf_block_unserialize_transport()]
> > libglusterfs/protocol:gf_block_unserialize_transport: full_read of
> > header failed
> > [Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
> > protocol/server:cleaned up xl_private of 0x502300
> > [Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
> > tcp/server:destroying transport object for 192.168.0.96:1014 (fd=4)
> > </snip>
> >
> > we're using 4 bricks in this setup and for the moment, just one client
> > (would like to scale between 20-30 clients and 4-8 server bricks).
> > the same behavior is observed when used with or without any combination
> > of any of the performance translators as well as with or without file
> > replication.   alu, random, and round-robin schedulers were all used in
> > our testing.
> > the systems in question are running CentOS (4.4).   these logs are from
> > our 64-bit systems but we have seen the exact same thing on the 32-bit
> > ones as well.
> > this (glusterfs) looks like it could be a good fit for some of the
> > high-traffic domains we host, but unless we can resolve this issue,
> > we'll have to continue using NFS.
> >
> >
> > our current server-side (brick) config consists of the following:
> > ##-- begin server config
> > volume vol1
> >   type storage/posix
> >   option directory /vol/vol1/gfs
> > end-volume
> >
> > volume vol2
> >   type storage/posix
> >   option directory /vol/vol2/gfs
> > end-volume
> >
> > volume vol3
> >   type storage/posix
> >   option directory /vol/vol3/gfs
> > end-volume
> >
> > volume brick1
> >   type performance/io-threads
> >   option thread-count 8
> >   subvolumes vol1
> > end-volume
> >
> > volume brick2
> >   type performance/io-threads
> >   option thread-count 8
> >   subvolumes vol2
> > end-volume
> >
> > volume brick3
> >   type performance/io-threads
> >   option thread-count 8
> >   subvolumes vol3
> > end-volume
> >
> > volume server
> >   type protocol/server
> >   option transport-type tcp/server
> >   option bind-address 10.88.188.91
> >   subvolumes brick1 brick2 brick3
> >   option auth.ip.brick1.allow 192.168.0.*
> >   option auth.ip.brick2.allow 192.168.0.*
> >   option auth.ip.brick3.allow 192.168.0.*
> > end-volume
> > ##-- end server config
> >
> >
> > our client config is as follows:
> >
> > ##-- begin client config
> > volume test00.1
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.91
> >   option remote-subvolume brick1
> > end-volume
> > volume test00.2
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.91
> >   option remote-subvolume brick2
> > end-volume
> > volume test00.3
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.91
> >   option remote-subvolume brick3
> > end-volume
> >
> >
> > volume test01.1
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.92
> >   option remote-subvolume brick1
> > end-volume
> > volume test01.2
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.92
> >   option remote-subvolume brick2
> > end-volume
> > volume test01.3
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.92
> >   option remote-subvolume brick3
> > end-volume
> >
> >
> > volume test02.1
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.93
> >   option remote-subvolume brick1
> > end-volume
> > volume test02.2
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.93
> >   option remote-subvolume brick2
> > end-volume
> > volume test02.3
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.93
> >   option remote-subvolume brick3
> > end-volume
> >
> >
> > volume test03.1
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.94
> >   option remote-subvolume brick1
> > end-volume
> > volume test03.2
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.94
> >   option remote-subvolume brick2
> > end-volume
> > volume test03.3
> >   type protocol/client
> >   option transport-type tcp/client
> >   option remote-host 192.168.0.94
> >   option remote-subvolume brick3
> > end-volume
> >
> >
> >
> > volume afr0
> >   type cluster/afr
> >   subvolumes test00.1 test01.2 test02.3
> >   option replicate *.html:3,*.db:1,*:3
> > end-volume
> >
> > volume afr1
> >   type cluster/afr
> >   subvolumes test01.1 test02.2 test03.3
> >   option replicate *.html:3,*.db:1,*:3
> > end-volume
> >
> > volume afr2
> >   type cluster/afr
> >   subvolumes test02.1 test03.2 test00.3
> >   option replicate *.html:3,*.db:1,*:3
> > end-volume
> >
> > volume afr3
> >   type cluster/afr
> >   subvolumes test03.1 test00.2 test01.3
> >   option replicate *.html:3,*.db:1,*:3
> > end-volume
> >
> >
> > volume bricks
> >   type cluster/unify
> >   subvolumes afr0 afr1 afr2 afr3
> >   option readdir-force-success on
> >
> >   option scheduler alu
> >   option alu.limits.min-free-disk  60GB
> >   option alu.limits.max-open-files 10000
> >
> >   option alu.order
> > disk-usage:read-usage:open-files-usage:write-usage:disk-speed-usage
> >
> >   option alu.disk-usage.entry-threshold 2GB
> >   option alu.disk-usage.exit-threshold  60MB
> >   option alu.open-files-usage.entry-threshold 1024
> >   option alu.open-files-usage.exit-threshold 32
> >   option alu.stat-refresh.interval 10sec
> >
> >  option alu.read-usage.entry-threshold 20%
> >  option alu.read-usage.exit-threshold 4%
> >  option alu.write-usage.entry-threshold 20%
> >  option alu.write-usage.exit-threshold 4%
> >
> > end-volume
> > ##-- end client config
> >
> >
> > ~Shawn
> >
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxx
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
>