Sascha, few points - 1. do you really want 4 copies of the NS with AFR? I personally think that is an overkill. 2 should be sufficient. 2. as you rightly mentioned, it might be the self heal which is slowing down. Do you have directories with a LOT of files in the immediate level? the self-heal is being heavily reworked to be more memory and cpu efficient and will be completed very soon. If you do have a LOT of files in a directory (not subdirs), then, it would help to recreate the NS offline and slip it in with the upgraded glusterfs. one half-efficient way: on each server: mkdir /partial-ns-tmp (cd /data/export/dir ; find . -type d) | (cd /partial-ns-tmp ; xargs mkdir -p) (cd /data/export/dir ; find . -type f) | (cd /partial-ns-tmp; xargs touch) now tar the /partial-ns-tmp on each server and extract them over each other in the name server. I assume you do not have special fifo and device files, if you do, recreate them like the mkdir too :) the updated self-heal should handle such cases much better (assuming your problem is LOTS of files in the same dir and/or LOTS of such dirs). avati 2008/1/8, Sascha Ottolski <ottolski@xxxxxx>: > > Hi list, > > after some rather depressing unsuccessful attempts, I'm wondering if > someone > has a hint what we could do to accomplish the above task on a productive > system: anytime we tried it yet, we had to roll back, since the load of > the > servers and the clients climbed so high that our application became > unusable. > > Our understanding is, that the introduction of the namespace is killing > us, > but did not find a way how get around the problem. > > The setup: 4 servers, each has two bricks and a namespace; the bricks are > on > separate raid arrays. The client do an afr so that server 1 and 2 mirror > each > other, as do as 3 and 4. After that, the four resulting afrs are unified > (see > config below). The setup is working so far, but not very stable (i.e. we > see > memory leaks on client side). The upgraded version has the four namespaces > afr-ed as well. We have about 20 clients connected that only and rarely > write, and 7 clients that only but massively read (that is, apache > webservers > serving the images). All machines are connected through GB Ethernet. > > May be the source of the problem is, what we store on the cluster: Thats > about > 12 mio. images, adding to a size of ~300 GB, in a very very nested > directory > structure. So, lots of relatively small files. And we are about to add > another 15 mio. files of even smaller size, they consume only 50 GB in > total, > most files only 1 or 2 KB in size. > > Now, if we start the new gluster with a new, empty namespace, it only > takes > minutes to have the load on the servers to be around 1.5, and on the > reading > clients to jump as high as 200(!). Obviously, no more images get delivered > to > connected browers. You can imagine that we did not even remotely thought > to > add the load of rebuilding the namespace by force, so all the load seems > to > be coming from self-heal. > > In an earlier attempt with 1.3.2, this picture didn't change much even > after a > forced rebuild of the namespace (which took about 24(!)) hours. Also, > using > only one namespace brick and no afr did help (but it became clear that the > server with the namespace was much more loaded than the others). > > So far, we did not find a proper way to simulate the problems on a test > system, which makes it even harder to find a solution :-( > > One idea that comes to mind is, could we somehow prepare the namespace > bricks > on the old version cluster, to reduce the necessity of the self-healing > mechanism after the upgrade? > > Thanks for reading this much, I hope I've drawn the picture thoroughly, > please > let me know if any thing is missing. > > > Cheer, Sascha > > > server config: > > volume fsbrick1 > type storage/posix > option directory /data1 > end-volume > > volume fsbrick2 > type storage/posix > option directory /data2 > end-volume > > volume nsfsbrick1 > type storage/posix > option directory /data-ns1 > end-volume > > volume brick1 > type performance/io-threads > option thread-count 8 > option queue-limit 1024 > subvolumes fsbrick1 > end-volume > > volume brick2 > type performance/io-threads > option thread-count 8 > option queue-limit 1024 > subvolumes fsbrick2 > end-volume > > ### Add network serving capability to above bricks. > volume server > type protocol/server > option transport-type tcp/server # For TCP/IP transport > option listen-port 6996 # Default is 6996 > option client-volume-filename /etc/glusterfs/glusterfs-client.vol > subvolumes brick1 brick2 nsfsbrick1 > option auth.ip.brick1.allow * # Allow access to "brick" volume > option auth.ip.brick2.allow * # Allow access to "brick" volume > option auth.ip.nsfsbrick1.allow * # Allow access to "brick" volume > end-volume > > ----------------------------------------------------------------------- > > client config > > volume fsc1 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.95 > option remote-subvolume brick1 > end-volume > > volume fsc1r > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.95 > option remote-subvolume brick2 > end-volume > > volume fsc2 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.96 > option remote-subvolume brick1 > end-volume > > volume fsc2r > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.96 > option remote-subvolume brick2 > end-volume > > volume fsc3 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.97 > option remote-subvolume brick1 > end-volume > > volume fsc3r > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.97 > option remote-subvolume brick2 > end-volume > > volume fsc4 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.98 > option remote-subvolume brick1 > end-volume > > volume fsc4r > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.98 > option remote-subvolume brick2 > end-volume > > volume afr1 > type cluster/afr > subvolumes fsc1 fsc2r > end-volume > > volume afr2 > type cluster/afr > subvolumes fsc2 fsc1r > end-volume > > volume afr3 > type cluster/afr > subvolumes fsc3 fsc4r > end-volume > > volume afr4 > type cluster/afr > subvolumes fsc4 fsc3r > end-volume > > volume ns1 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.95 > option remote-subvolume nsfsbrick1 > end-volume > > volume ns2 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.96 > option remote-subvolume nsfsbrick1 > end-volume > > volume ns3 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.97 > option remote-subvolume nsfsbrick1 > end-volume > > volume ns4 > type protocol/client > option transport-type tcp/client > option remote-host 10.10.1.98 > option remote-subvolume nsfsbrick1 > end-volume > > volume afrns > type cluster/afr > subvolumes ns1 ns2 ns3 ns4 > end-volume > > volume bricks > type cluster/unify > subvolumes afr1 afr2 afr3 afr4 > option namespace afrns > option scheduler alu > option alu.limits.min-free-disk 5% > option alu.limits.max-open-files 10000 > option alu.order > disk-usage:read-usage:write-usage:open-files-usage:disk-speed > -usage > option alu.disk-usage.entry-threshold 2GB > option alu.disk-usage.exit-threshold 60MB > option alu.open-files-usage.entry-threshold 1024 > option alu.open-files-usage.exit-threshold 32 > end-volume > > volume readahead > type performance/read-ahead > option page-size 256KB > option page-count 2 > subvolumes bricks > end-volume > > volume write-behind > type performance/write-behind > option aggregate-size 1MB > subvolumes readahead > end-volume > > volume io-cache > type performance/io-cache > option page-size 128KB > option cache-size 64MB > subvolumes write-behind > end-volume > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > http://lists.nongnu.org/mailman/listinfo/gluster-devel > -- If I traveled to the end of the rainbow As Dame Fortune did intend, Murphy would be there to tell me The pot's at the other end.