Re: seeking advice: how to upgrade from 1.3.0pre4 to tla patch628?

"Anand Avati" <avati@xxxxxxxxxxxxx> · Tue, 8 Jan 2008 14:36:33 +0530

Sascha,
 few points -

1. do you really want 4 copies of the NS with AFR? I personally think that
is an overkill. 2 should be sufficient.

2. as you rightly mentioned, it might be the self heal which is slowing
down. Do you have directories with a LOT of files in the immediate level?
the self-heal is being heavily reworked to be more memory and cpu efficient
and will be completed very soon. If you do have a LOT of files in a
directory (not subdirs), then, it would help to recreate the NS offline and
slip it in with the upgraded glusterfs. one half-efficient way:

on each server:
mkdir /partial-ns-tmp
(cd /data/export/dir ; find . -type d) | (cd /partial-ns-tmp ; xargs mkdir
-p)
(cd /data/export/dir ; find . -type f) | (cd /partial-ns-tmp; xargs touch)

now tar the /partial-ns-tmp on each server and extract them over each other
in the name server. I assume you do not have special fifo and device files,
if you do, recreate them like the mkdir too :)

the updated self-heal should handle such cases much better (assuming your
problem is LOTS of files in the same dir and/or LOTS of such dirs).

avati

2008/1/8, Sascha Ottolski <ottolski@xxxxxx>:
>
> Hi list,
>
> after some rather depressing unsuccessful attempts, I'm wondering if
> someone
> has a hint what we could do to accomplish the above task on a productive
> system: anytime we tried it yet, we had to roll back, since the load of
> the
> servers and the clients climbed so high that our application became
> unusable.
>
> Our understanding is, that the introduction of the namespace is killing
> us,
> but did not find a way how get around the problem.
>
> The setup: 4 servers, each has two bricks and a namespace; the bricks are
> on
> separate raid arrays. The client do an afr so that server 1 and 2 mirror
> each
> other, as do as 3 and 4. After that, the four resulting afrs are unified
> (see
> config below). The setup is working so far, but not very stable (i.e. we
> see
> memory leaks on client side). The upgraded version has the four namespaces
> afr-ed as well. We have about 20 clients connected that only and rarely
> write, and 7 clients that only but massively read (that is, apache
> webservers
> serving the images). All machines are connected through GB Ethernet.
>
> May be the source of the problem is, what we store on the cluster: Thats
> about
> 12 mio. images, adding to a size of ~300 GB, in a very very nested
> directory
> structure. So, lots of relatively small files. And we are about to add
> another 15 mio. files of even smaller size, they consume only 50 GB in
> total,
> most files only 1 or 2 KB in size.
>
> Now, if we start the new gluster with a new, empty namespace, it only
> takes
> minutes to have the load on the servers to be around 1.5, and on the
> reading
> clients to jump as high as 200(!). Obviously, no more images get delivered
> to
> connected browers. You can imagine that we did not even remotely thought
> to
> add the load of rebuilding the namespace by force, so all the load seems
> to
> be coming from self-heal.
>
> In an earlier attempt with 1.3.2, this picture didn't change much even
> after a
> forced rebuild of the namespace (which took about 24(!)) hours. Also,
> using
> only one namespace brick and no afr did help (but it became clear that the
> server with the namespace was much more loaded than the others).
>
> So far, we did not find a proper way to simulate the problems on a test
> system, which makes it even harder to find a solution :-(
>
> One idea that comes to mind is, could we somehow prepare the namespace
> bricks
> on the old version cluster, to reduce the necessity of the self-healing
> mechanism after the upgrade?
>
> Thanks for reading this much, I hope I've drawn the picture thoroughly,
> please
> let me know if any thing is missing.
>
>
> Cheer, Sascha
>
>
> server config:
>
> volume fsbrick1
>   type storage/posix
>   option directory /data1
> end-volume
>
> volume fsbrick2
>   type storage/posix
>   option directory /data2
> end-volume
>
> volume nsfsbrick1
>   type storage/posix
>   option directory /data-ns1
> end-volume
>
> volume brick1
>   type performance/io-threads
>   option thread-count 8
>   option queue-limit 1024
>   subvolumes fsbrick1
> end-volume
>
> volume brick2
>   type performance/io-threads
>   option thread-count 8
>   option queue-limit 1024
>   subvolumes fsbrick2
> end-volume
>
> ### Add network serving capability to above bricks.
> volume server
>   type protocol/server
>   option transport-type tcp/server     # For TCP/IP transport
>   option listen-port 6996              # Default is 6996
>   option client-volume-filename /etc/glusterfs/glusterfs-client.vol
>   subvolumes brick1 brick2 nsfsbrick1
>   option auth.ip.brick1.allow * # Allow access to "brick" volume
>   option auth.ip.brick2.allow * # Allow access to "brick" volume
>   option auth.ip.nsfsbrick1.allow * # Allow access to "brick" volume
> end-volume
>
> -----------------------------------------------------------------------
>
> client config
>
> volume fsc1
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.95
>   option remote-subvolume brick1
> end-volume
>
> volume fsc1r
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.95
>   option remote-subvolume brick2
> end-volume
>
> volume fsc2
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.96
>   option remote-subvolume brick1
> end-volume
>
> volume fsc2r
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.96
>   option remote-subvolume brick2
> end-volume
>
> volume fsc3
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.97
>   option remote-subvolume brick1
> end-volume
>
> volume fsc3r
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.97
>   option remote-subvolume brick2
> end-volume
>
> volume fsc4
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.98
>   option remote-subvolume brick1
> end-volume
>
> volume fsc4r
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.98
>   option remote-subvolume brick2
> end-volume
>
> volume afr1
>   type cluster/afr
>   subvolumes fsc1 fsc2r
> end-volume
>
> volume afr2
>   type cluster/afr
>   subvolumes fsc2 fsc1r
> end-volume
>
> volume afr3
>   type cluster/afr
>   subvolumes fsc3 fsc4r
> end-volume
>
> volume afr4
>   type cluster/afr
>   subvolumes fsc4 fsc3r
> end-volume
>
> volume ns1
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.95
>   option remote-subvolume nsfsbrick1
> end-volume
>
> volume ns2
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.96
>   option remote-subvolume nsfsbrick1
> end-volume
>
> volume ns3
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.97
>   option remote-subvolume nsfsbrick1
> end-volume
>
> volume ns4
>   type protocol/client
>   option transport-type tcp/client
>   option remote-host 10.10.1.98
>   option remote-subvolume nsfsbrick1
> end-volume
>
> volume afrns
>   type cluster/afr
>   subvolumes ns1 ns2 ns3 ns4
> end-volume
>
> volume bricks
>   type cluster/unify
>   subvolumes afr1 afr2 afr3 afr4
>   option namespace afrns
>   option scheduler alu
>   option alu.limits.min-free-disk  5%
>   option alu.limits.max-open-files 10000
>   option alu.order
> disk-usage:read-usage:write-usage:open-files-usage:disk-speed
> -usage
>   option alu.disk-usage.entry-threshold 2GB
>   option alu.disk-usage.exit-threshold  60MB
>   option alu.open-files-usage.entry-threshold 1024
>   option alu.open-files-usage.exit-threshold 32
> end-volume
>
> volume readahead
>   type performance/read-ahead
>   option page-size 256KB
>   option page-count 2
>   subvolumes bricks
> end-volume
>
> volume write-behind
>   type performance/write-behind
>   option aggregate-size 1MB
>   subvolumes readahead
> end-volume
>
> volume io-cache
>   type performance/io-cache
>   option page-size 128KB
>   option cache-size 64MB
>   subvolumes write-behind
> end-volume
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.