Re: OSD Performance

Kevin Walker <kwalker@xxxxxxxxxxxxxxxxx> · Wed, 25 Feb 2015 07:55:34 +0400

Hi Christian

We are just looking at options at this stage. 

Using a hardware RAM disk for the journal is the same concept as the SolidFire guys, who are also using XFS (at least they were last time I crossed paths with a customer using SolidFire) and from my experiences with ZFS, using a RAM based log device is a far safer option than enterprise slc ssd's for write log data. But, I am guessing with the performance of an SSD being a lot higher than a spindle, the need for a separate journal is negated. Each OSD has a journal, so if it fails and journal fails with it, it is not such a big problem as it would be with ZFS?

For the OSD's we are actually thinking of using low cost Samsung 1TB DC SSD's, but based on what you are saying even that level of performance will be unreachable due to the cpu overhead. 

Does this improve with RDMA?
Is anyone on the list using alternative high core count non x86 architectures  (Tilera/ThunderX)? 
Would more threads help with this problem?

As mentioned at the beginning, we are looking at options, spindles might end up being a better option, with an SSD tier, hence my question about fragmentation, but the problem for us is power consumption. Having say 16 OSD nodes (24 spindles each), plus 3 monitor nodes and 38 xeons consuming 100W each is a huge opex bill to factor against ROI. 

We are running VMware vSphere and testing vCloud with OnApp, so are expecting we will have to build a couple of nodes to provide FC targets, which adds further power consumption. 

Kind regards

Kevin Walker
+968 9765 1742

On 25 Feb 2015, at 04:40, Christian Balzer <chibi@xxxxxxx> wrote:

On Wed, 25 Feb 2015 02:50:59 +0400 Kevin Walker wrote:

> Hi Mark
> 
> Thanks for the info, 22k is not bad, but still massively below what a
> pcie ssd can achieve. Care to expand on why the write IOPS are so low?

Aside from what Mark mentioned in his reply there's also latency to be
considered in the overall picture.

But my (and other people's tests, including Mark's recent PDF posted here)
clearly indicate where the problem with small write (4k) IOPS is, the
CPU utilization by mostly Ceph code (but significant OS time, too).

To quote myself:
I did some brief tests with a machine having 8 DC S3700 100GB for OSDs
(replica 1) under 0.80.6 and the right (make that wrong) type of load
(small, 4k I/Os) did melt all of the 8 3.5GHz cores in that box.
While never exceeding 15% utilization of the SSDs.

Even with further optimizations I predict the CPUs() to remain the limiting
factor for small write IOPS. 
So with that in mind, a pure SSD storage node design will have to consider
that and spend money where it actually improves things.

> Was this with a separate RAM disk pcie device or SLC SSD for the journal?
> 
> That fragmentation percentage looks good. We are considering using just
> SSD's for OSD's and RAM disk pcie devices for the Journals so this would
> be ok.
For starters, you clearly have too much money.
You're not going to see a good return on investment, as per what I wrote
above. Even faster journals are pointless, having the journal on the
actual OSD SSDs is a non-issue performance wise and makes things a lot
more straightforward. 
I could totally see a much more primitive (HDD OSDs, journal SSDs) but
more balanced and parallelized cluster outperform your design at the same
cost (but admittedly more space usage). 

Secondly, why would you even care one iota about file system fragmentation
when using SSDs for all your storage?

Regards,

Christian

> Kind regards
> 
> Kevin Walker
> +968 9765 1742
> 
>> On 25 Feb 2015, at 02:35, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>> 
>> On 02/24/2015 04:21 PM, Kevin Walker wrote:
>> Hi All
>> 
>> Just recently joined the list and have been reading/learning about ceph
>> for the past few months. Overall it looks to be well suited to our
>> cloud platform but I have stumbled across a few worrying items that
>> hopefully you guys can clarify the status of.
>> 
>> Reading through various mailing list archives, it would seem an OSD
>> caps out at about 3k IOPS. Dieter Kasper from Fujistu made an
>> interesting observation about the size of the OSD code(20k plus lines
>> at that time), is this being optimized further and has this IOPS limit
>> been improved in Giant?
> 
> In recent tests under fairly optimal conditions, I'm seeing performance
> topping out at about 4K object writes/s and 22K object reads/s against
> an OSD with a very fast PCIe SSD.  There are several reasons writes are
> slower than reads, but this is something we are working on improving in
> a variety of ways.
> 
> I believe others may have achieved even higher results.
> 
>> 
>> Is there a way to over come the XFS fragmentation problems other users
>> have experienced?
> 
> Setting the newish filestore_xfs_extsize parameter to true appears to
> help in testing we did a couple months ago.  We filled up a cluster to
> near capacity (~70%) and then did 12 hours of random writes.  After the
> test completed, with filestore_xfs_extsize disabled we were seeing
> something like 13% fragmentation, while with it enabled we were seeing
> around 0.02% fragmentation.
> 
>> 
>> Kind regards
>> 
>> Kevin
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx       Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com