Re: OSD Performance

Christian Balzer <chibi@xxxxxxx> · Wed, 25 Feb 2015 13:50:33 +0900

Hello Kevin,

On Wed, 25 Feb 2015 07:55:34 +0400 Kevin Walker wrote:

> Hi Christian
> 
> We are just looking at options at this stage. 
>
Never a bad thing to do.

> Using a hardware RAM disk for the journal is the same concept as the
> SolidFire guys, who are also using XFS (at least they were last time I
> crossed paths with a customer using SolidFire) 

Ah, SolidFire. 
Didn't knew that, from looking at what they're doing I was expecting them
to use ZFS (or something totally self-made) and not XFS.
Journals in the Ceph sense and journals in the file system sense perform
similar functions, but still a bit of an apples and oranges case.

As for SolidFire itself, if you look at their numbers (I consider the
overall compression ratio to be vastly optimistic) you'll see that they
aren't leveraging the full SSD potentials either. 
Which is of course going to be pretty much impossible for anything
distributed.

> and from my experiences
> with ZFS, using a RAM based log device is a far safer option than
> enterprise slc ssd's for write log data. 

No argument here, from where I'm standing the only enterprise SSDs
fully worth the name are Intel DC 3700s.

> But, I am guessing with the
> performance of an SSD being a lot higher than a spindle, the need for a
> separate journal is negated. Each OSD has a journal, so if it fails and
> journal fails with it, it is not such a big problem as it would be with
> ZFS?
> 
Precisely. The protection here comes from the replication, not the journal
in and by itself. 
This 1:1 OSD/journal setup also prevents you from loosing multiple OSDs if
a super fast journal fails.

> For the OSD's we are actually thinking of using low cost Samsung 1TB DC
> SSD's, but based on what you are saying even that level of performance
> will be unreachable due to the cpu overhead. 
> 
Which exact Samsung model? 
If you throw enough CPU at few enough OSDs you might get closer to pushing
things over 50%. ^o^

> Does this improve with RDMA?

You (or the people developing this part) tell me.
Things overall will improve of course if this is done right, but it won't
solve the CPU contention which isn't really related to data movement.
Lower latency by using native IB will be another bonus, but again not
make the Ceph code (and other bits) magically more efficient.  

> Is anyone on the list using alternative high core count non x86
> architectures  (Tilera/ThunderX)? Would more threads help with this
> problem?
> 
As a gut feeling, I'd say no, but somebody correct me if I'm wrong,
especially with the recent sharding improvements/additions. 
It felt to me that more cores will reach a point of diminishing return,
faster cores not so much/quickly.

> As mentioned at the beginning, we are looking at options, spindles might
> end up being a better option, with an SSD tier, hence my question about
> fragmentation, but the problem for us is power consumption. Having say
> 16 OSD nodes (24 spindles each), plus 3 monitor nodes and 38 xeons
> consuming 100W each is a huge opex bill to factor against ROI. 
> 
See recent discussions about SSD tiers here. 
Quite a bit of things that would make Ceph more attractive or suitable for
you are in the pipeline, but in some cases probably a year or more away.

At 16 nodes you're probably OK to go with such dense servers (1 node going
down kills 24 OSDs and the resulting data storm won't be pretty).
You might find something with 12 HDDs and 2 SSDs easier to balance (CPU
power/RAM to OSDs) and not much more expensive.

Dedicated monitor nodes are only needed if your other servers have no fast
local storage and/or are totally overloaded.  I'd go for 1 or 2 dedicated
monitors (the primary and thus busiest monitor is picked based on the
lowest IP!) and use the saved money to beef up some other nodes for a
total of 5 monitors.

> We are running VMware vSphere and testing vCloud with OnApp, so are
> expecting we will have to build a couple of nodes to provide FC targets,
> which adds further power consumption. 
> 
Not to mention complication.
I have seen many people talking about iSCSI and other heads for Ceph
(which of course isn't exactly efficient compared to native RBD) but can't
recall a single "this is how you do it, guaranteed to work 100% in all use
cases" solution or guide.

Another group in-house here runs XenServer, they would love to use a Ceph
cluster made by me for cheaper storage than NetAPP or 3PAR, but since
XenServer only supports NFS or iSCSI, I don't see that happening any time
soon. 

Christian
> 
> Kind regards
> 
> Kevin Walker
> +968 9765 1742
> 
> On 25 Feb 2015, at 04:40, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> On Wed, 25 Feb 2015 02:50:59 +0400 Kevin Walker wrote:
> 
> > Hi Mark
> > 
> > Thanks for the info, 22k is not bad, but still massively below what a
> > pcie ssd can achieve. Care to expand on why the write IOPS are so low?
> 
> Aside from what Mark mentioned in his reply there's also latency to be
> considered in the overall picture.
> 
> But my (and other people's tests, including Mark's recent PDF posted
> here) clearly indicate where the problem with small write (4k) IOPS is,
> the CPU utilization by mostly Ceph code (but significant OS time, too).
> 
> To quote myself:
> I did some brief tests with a machine having 8 DC S3700 100GB for OSDs
> (replica 1) under 0.80.6 and the right (make that wrong) type of load
> (small, 4k I/Os) did melt all of the 8 3.5GHz cores in that box.
> While never exceeding 15% utilization of the SSDs.
> 
> Even with further optimizations I predict the CPUs() to remain the
> limiting factor for small write IOPS. 
> So with that in mind, a pure SSD storage node design will have to
> consider that and spend money where it actually improves things.
> 
> > Was this with a separate RAM disk pcie device or SLC SSD for the
> > journal?
> > 
> > That fragmentation percentage looks good. We are considering using just
> > SSD's for OSD's and RAM disk pcie devices for the Journals so this
> > would be ok.
> For starters, you clearly have too much money.
> You're not going to see a good return on investment, as per what I wrote
> above. Even faster journals are pointless, having the journal on the
> actual OSD SSDs is a non-issue performance wise and makes things a lot
> more straightforward. 
> I could totally see a much more primitive (HDD OSDs, journal SSDs) but
> more balanced and parallelized cluster outperform your design at the same
> cost (but admittedly more space usage). 
> 
> Secondly, why would you even care one iota about file system
> fragmentation when using SSDs for all your storage?
> 
> Regards,
> 
> Christian
> 
> > Kind regards
> > 
> > Kevin Walker
> > +968 9765 1742
> > 
> >> On 25 Feb 2015, at 02:35, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> >> 
> >> On 02/24/2015 04:21 PM, Kevin Walker wrote:
> >> Hi All
> >> 
> >> Just recently joined the list and have been reading/learning about
> >> ceph for the past few months. Overall it looks to be well suited to
> >> our cloud platform but I have stumbled across a few worrying items
> >> that hopefully you guys can clarify the status of.
> >> 
> >> Reading through various mailing list archives, it would seem an OSD
> >> caps out at about 3k IOPS. Dieter Kasper from Fujistu made an
> >> interesting observation about the size of the OSD code(20k plus lines
> >> at that time), is this being optimized further and has this IOPS limit
> >> been improved in Giant?
> > 
> > In recent tests under fairly optimal conditions, I'm seeing performance
> > topping out at about 4K object writes/s and 22K object reads/s against
> > an OSD with a very fast PCIe SSD.  There are several reasons writes are
> > slower than reads, but this is something we are working on improving in
> > a variety of ways.
> > 
> > I believe others may have achieved even higher results.
> > 
> >> 
> >> Is there a way to over come the XFS fragmentation problems other users
> >> have experienced?
> > 
> > Setting the newish filestore_xfs_extsize parameter to true appears to
> > help in testing we did a couple months ago.  We filled up a cluster to
> > near capacity (~70%) and then did 12 hours of random writes.  After the
> > test completed, with filestore_xfs_extsize disabled we were seeing
> > something like 13% fragmentation, while with it enabled we were seeing
> > around 0.02% fragmentation.
> > 
> >> 
> >> Kind regards
> >> 
> >> Kevin
> >> 
> >> 
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com