Re: Impact of fancy striping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A general rule of thumb for separate journal devices is to use 1 SSD for every 4 OSDs.  Since SSDs have no seek penalty, 4 partitions are fine.  Going much above the 1:4 ratio can saturate the SSD.

On your SAS journal device, by creating 9 partitions, you're forcing head seeks for every journal write (assuming all 9 OSDs are writing).  Try using the SAS device with a single partition and 9 journals.  That gives you a change to get sequential IO.  For an anecdote of this effect, check out http://thedailywtf.com/Articles/The-Certified-DBA.aspx.

Even then, I suspect you'll saturate the RAID0'ed SAS devices as they generally have less sequential IO than SSDs.



I assume that you're aware that by using RAID0 for the journals, a single SAS disk failure will take down all 9 OSDs.

Craig Lewis
Senior Systems Engineer
Office +1.714.602.1309
Email clewis@xxxxxxxxxxxxxxxxxx

Central Desktop. Work together in ways you never thought possible.
Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog

On 11/29/13 05:58 , nicolasc wrote:
Hi James,

Unfortunately, SSDs are out of budget. Currently there are 2 SAS disks in RAID0 on each node, split into 9 partitions: one for each OSD journal on the node. I benchmarked the RAID0 volumes at around 500MB/s in sequential sustained write, so that's not bad — maybe access latency is also an issue?

This journal problem is a bit of wizardry to me, I even had weird intermittent issues with OSDs not starting because the journal was not found, so please do not hesitate to suggest a better journal setup.

I will try to look into this issue of device cache flush. Do you have a tracker link for the bug?

Last question (for every one) is: which one of the journal config or the striping config has, in your opinion, the most influence on my "performance decreases with small blocks" problem?

Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)


On 11/29/2013 02:06 PM, James Pearce wrote:
Did you try moving the journals to separate SSDs?

It was recently discovered that due to a kernel bug/design, the journal writes are translated into device cache flush commands, so thinking about that I wonder also whether there would be performance improvement in the case that journal and OSD are on the same physical drive implementing the workaround, since currently the system is presumably hitting spindle latency for every write?

On 2013-11-29 12:46, nicolasc wrote:
Hi every one,

I am currently testing a use-case with large rbd images (several TB),
each containing an XFS filesystem, which I mount on local clients. I
have been testing the throughput writing on a single file in the XFS
mount, using "dd oflag=direct", for various block sizes.

With a default config, the "XFS writes with dd" show very good
performances for 1GB blocks, but it drops down to average HDD
performances for 4MB blocks, and to only a few MB/s for 4kB blocks.
Changing the XFS block size did not help, so I tried fancy striping —
max block size is 256kB in XFS anyway.

First, using 4kB rados objects to store the 4kB stripes was awful,
because rados does not like small objects. Then, I used fancy striping
to store several 4kB stripes into a single 4MB object, but it hardly
improved the performance with 4kB blocks, while drastically degrading
the performance for large blocks.

Given my use-case, the block size of writes cannot exceed 4MB. I do
not know a lof of applications that write to disk by 1GB blocks.
Currently, on a 6-nodes, 54-OSDs cluster, with journal on dedicated
SAS disks and 10GbE dedicated uplink, I am getting performances
equivalent to a basic local disc.

So I am wondering: is it possible to have good performances with XFS
on rbd images, using a reasonable block size?

In case you think the answer is "yes", I would greatly appreciate it
if you could gave me a clue about the striping magic involved.

Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux