Re: Optimizing write performance to a few large files in a small cluster

Robin Jonsson <Robin.Jonsson@xxxxxxx> · Tue, 11 Mar 2014 11:34:00 +0000

Carlos:

The RAM disks are 1GB each, which is way more than needed in my case. There are the really small writes/reads seems o be really costly on the performance with gluster. The storage bricks have 64GB of RAM each. Unfortunately, I haven’t figured out how to export the RAM disks and still be able to export the gluster volumes via NFS. So the RAM disks are currently residing on another server. Using a SSD for scratching wouldn’t make me sleep well at night.. :P

Yes, rhs stands for Red Hat Storage. Quoted from the Administration Guide regarding rhs-high-throughput:

The profile performs the following:
	* Increases read ahead to 64MB
	* Changes I/O scheduler to deadline
	* Disables power-saving mode

Thanks for the tip regarding the buffers. Tried it and ran some timings with real world applications (tools for transient analysis). But didn’t recognise any noticeable difference in my case. Actually, I don’t believe the storage solution is the bottleneck in this setup atm. Which is great news, especially since I was thinking about throwing it out a window a few weeks back. :D 

Regards,
Robin

On 11 Mar 2014, at 11:17 am, Carlos Capriotti <capriotti.carlos@xxxxxxxxx> wrote:

Robin, would you mind elaborating a bit more on a few details ? This sounds VERY promising.
I know I could spend hours on the internet, googlling my way, but that way I'd be loosing about 50% of the objective info.

About the RAM disks, how big are those, and more to the point, how much memory do you have, so we can work out a ratio here. I feel that this RAM disk is  a VERY impacting possibility here (positive), and since I am gathering all hints and hacks I can, for a VMware-based environment, THAT one would most likely be a winner ! i would like to test it next week. Maybe using SSD instead ? (not that I have one to spare, of course).

Now, regarding 

'tuned-adm profile; tuned-adm profile rhs-high-throughput’ on all storage bricks

 I cannot remember seeing those option anywhere. Where did you find them ? Does "rhs" stand for Red Hat Storage ?

One last comment that might help: when mounting NFS, tweak the buffers, and async operations with this:

-o rw,async,vers=3,rsize=65536,wsize=65536

KR,

Carlos

On Tue, Mar 11, 2014 at 10:37 AM, Robin Jonsson <Robin.Jonsson@xxxxxxx> wrote:

Alexander:
I have also experienced the stalls you are explaining. This was in a 2 brick setup running replicated volumes used by a 20 node HPC. 

In my case this was solved by: 

* Replace FUSE with NFS
	* This is by far the biggest booster
* RAM disks for the scratch directories (not connected to gluster at all)
	* If you’re not sure where these directories are, run ‘gluster volume top <volume> write list-cnt 10’
* 'tuned-adm profile; tuned-adm profile rhs-high-throughput’ on all storage bricks
* The following volume options
	* cluster.nufa: enable
	* performance.quick-read: on
	* performance.open-behind: on
* Mount option on clients
	* noatime
		* Use only where access time isn’t needed.
		* Major booster for small file writes in my case. Even with the FUSE client.

Hope this helps, 

Regards,
Robin

On 10 Mar 2014, at 19:06 pm, Alexander Valys <avalys@xxxxxxxxxx> wrote:

A quick performance question.

I have a small cluster of 4 machines, 64 cores in total.  I am running a scientific simulation on them, which writes at between 0.1 and 10 MB/s (total) to roughly 64 HDF5 files.  Each HDF5 file is written by only one process.  The writes are not continuous, but consist of writing roughly 1 MB of data to each file every few seconds.    

Writing to HDF5 involves a lot of reading the file metadata and random seeking within the file,  since we are actually writing to about 30 datasets inside each file.  I am hosting the output on a distributed gluster volume (one brick local to each machine) to provide a unified namespace for the (very rare) case when each process needs to read the other's files.  

I am seeing somewhat lower performance than I expected, i.e. a factor of approximately 4 less throughput than each node writing locally to the bare drives.  I expected the write-behind cache to buffer each write, but it seems that the writes are being quickly flushed across the network regardless of what write-behind cache size I use (32 MB currently), and the simulation stalls while waiting for the I/O operation to finish.  Anyone have any suggestions as to what to look at?  I am using gluster 3.4.2 on ubuntu 12.04.  I have flush-behind turned on, and have mounted the volume with direct-io-mode=disable, and have the cache size set to 256M.  

The nodes are connected via a dedicated gigabit ethernet network, carrying only gluster traffic (no simulation traffic).

(sorry if this message comes through twice, I sent it yesterday but was not subscribed)

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://supercolony.gluster.org/mailman/listinfo/gluster-users

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users