Re: Ceph RBD latencies

Christian Balzer <chibi@xxxxxxx> · Sat, 5 Mar 2016 14:44:01 +0900

Hello,

On Thu, 3 Mar 2016 23:26:13 +0000 Adrian Saul wrote:

> 
> > Samsung EVO...
> > Which exact model, I presume this is not a DC one?
> >
> > If you had put your journals on those, you would already be pulling
> > your hairs out due to abysmal performance.
> >
> > Also with Evo ones, I'd be worried about endurance.
> 
> No,  I am using the P3700DCs for journals.  

Yup, thats why I wrote "If you had...". ^o^

>The Samsungs are the 850 2TB
> (MZ-75E2T0BW).  Chosen primarily on price.  

These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5 years).
Unless you have a read-only cluster, you will wind up spending MORE on
replacing them (and/or loosing data when 2 fail at the same time) than
going with something more sensible like Samsung's DC models or the Intel
DC ones (S3610s come to mind for "normal" use). 
See also the current "List of SSDs" thread in this ML.

>We already built a system
> using the 1TB models with Solaris+ZFS and I have little faith in them.
> Certainly their write performance is erratic and not ideal.  We have
> other vendor options which are what they call "Enterprise Value" SSDs,
> but still 4x the price.   I would prefer a higher grade drive but
> unfortunately cost is being driven from above me.
>
Fast, reliable, cheap. Pick any 2. 

On your test setup or even better the Solaris one, have a look at their
media wearout, or  Wear_Leveling_Count as Samsung calls it.
I bet that makes for some scary reading.

> > > On the ceph side each disk in the OSD servers are setup as an
> > > individual OSD, with a 12G journal created on the flash mirror.   I
> > > setup the SSD servers into one root, and the SATA servers into
> > > another and created pools using hosts as fault boundaries, with the
> > > pools set for 2 copies.
> > Risky. If you have very reliable and well monitored SSDs you can get
> > away with 2 (I do so), but with HDDs and the combination of their
> > reliability and recovery time it's asking for trouble.
> > I realize that this is testbed, but if your production has a
> > replication of 3 you will be disappointed by the additional latency.
> 
> Again, cost - the end goal will be we build metro based dual site pools
> which will be 2+2 replication.  
Note that Ceph (RBD/RADOS to be precise) isn't particular suited for
"long" distance replication due to the incurred latencies. 

That's unless your replication is happening "above" Ceph in the iSCSI bits
with something that's more optimized for this. 

Something along the lines of the DRBD proxy has been suggested for Ceph,
but if at all it is a backburner project at best from what I gather.

> I am aware of the risks but already
> presenting numbers based on buying 4x the disk we are able to use gets
> questioned hard.
> 
There are some ways around this, which may or may not be suitable for your
use case.
EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
Of course this comes at a performance penalty, which you can offset again
with for example fast RAID controllers with HW cache to some extend.
But it may well turn out to be zero sum game.

Another thing is to use a cache pool (with top of the line SSDs), this is
of course only a sensible course of action if your hot objects will fit in
there.
In my case they do (about 10-20% of the 2.4TB raw pool capacity) and
everything is as fast as can be expected and the VMs (their time
critical/sensitive application to be precise) are happy campers.

> > This smells like garbage collection on your SSDs, especially since it
> > matches time wise what you saw on them below.
> 
> I concur.   I am just not sure why that impacts back to the client when
> from the client perspective the journal should hide this.   If the
> journal is struggling to keep up and has to flush constantly then
> perhaps, but  on the current steady state IO rate I am testing with I
> don't think the journal should be that saturated.
>
There's a counter in Ceph (counter-filestore_journal_bytes) that you can
graph for journal usage. 
The highest I have ever seen is about 100MB for HDD based OSDs, less than
8MB for SSD based ones with default(ish) Ceph parameters. 

Since you seem to have experience with ZFS (I don't really, but I read
alot ^o^), consider the Ceph journal equivalent to the ZIL.  
It is a write only journal, it never gets read from unless there is a
crash.
That is why sequential, sync write speed is the utmost criteria for Ceph
journal device.

If I recall correctly you were testing with 4MB block streams, thus pretty
much filling the pipe to capacity, atop on your storage nodes will give a
good insight. 

The journal is great to cover some bursts, but the Ceph OSD is flushing
things from RAM to the backing storage on configurable time limits and
once these are exceeded and/or you run out RAM (pagecache), you are
limited to what your backing storage can sustain.

Now in real life, you would want a cluster and especially OSDs that are
lightly to medium loaded on average and in that case a spike won't result
in a significant rise of latency. 

> > Have you tried the HDD based pool and did you see similar, consistent
> > interval, spikes?
> 
> To be honest I have been focusing on the SSD numbers but that would be a
> good comparison.
> 
> > Or alternatively, configured 2 of your NVMEs as OSDs?
> 
> That was what I was thinking of doing - move the NVMEs to the frontends,
> make them OSDs and configure them as a read-forward cache tier for the
> other pools, and just have the SSDs and SATA journal by default on a
> first partition.
> 
Madness lies down that path, also not what I meant.
For quick testing, leave the NVMEs right where they are, destroy your SSD
pool and create one with the 2 NVMEs per node as individual OSDs.
Test against that.

A read forward cache tier is exactly the opposite of what you want, you
want your writes to be fast and hit the fastest game in town (your NVMEs
preferably) and thus want writeback mode. 
Infernalis, or even better waiting for Jewel will help to keep the cache
as hot and unpolluted as possible with working recency configurations for
promotions.
But if anyhow possible, keep your base pools sufficiently fast as well,
so they can serve cache misses (promotions) or cache flushes adequately.
Keep in mind that a promotion or flush will (on average for RDB objects)
result in 4MB reads and writes.

In your case the SSDs are totally unsuitable to hold journals and will
both perform miserably and wear out even faster. 
And HDDs really benefit from SSD journals, especially when it comes to
IOPS.

I also recall your NVMEs being in a RAID1, presumably so that a failure
won't take out all your OSDs. 
While understandable, it is also quite wasteful. 
For starters you need to be able to sustain a node loss, so "half" a node
loss if a NVME fails must be within the the capability of your cluster.
This is why most people suggest starting with about 10 storage nodes for
production clusters, of course budget permitting (none of mine is that
size yet).

By using the NVMEs individually, you improve performance and lower their
write usage. 
Specifically, those 400GB P3700 can write about 1000MB/s, which is half
your network speed and will only saturate about 10 of your 36 HDDs.
And with Intel P3700s, you really don't have to worry about endurance to
boot.

Regards,

Christian
> > No, not really. The journal can only buffer so much.
> > There are several threads about this in the archives.
> >
> > You could tune it but that will only go so far if your backing storage
> > can't keep up.
> >
> > Regards,
> >
> > Christian
> 
> 
> Agreed - Thanks for your help.
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege.
> They are intended solely for the attention and use of the named
> addressee(s). They may only be copied, distributed or disclosed with the
> consent of the copyright owner. If you have received this email by
> mistake or by breach of the confidentiality clause, please notify the
> sender immediately by return email and delete or destroy all copies of
> the email. Any confidentiality, privilege or copyright is not waived or
> lost because this email has been sent to you by mistake.
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com