Re: Ceph RBD latencies

Christian Balzer <chibi@xxxxxxx> · Mon, 7 Mar 2016 14:13:42 +0900

Hello,

On Mon, 7 Mar 2016 00:38:46 +0000 Adrian Saul wrote:

> > >The Samsungs are the 850 2TB
> > > (MZ-75E2T0BW).  Chosen primarily on price.
> >
> > These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5
> > years). Unless you have a read-only cluster, you will wind up spending
> > MORE on replacing them (and/or loosing data when 2 fail at the same
> > time) than going with something more sensible like Samsung's DC models
> > or the Intel DC ones (S3610s come to mind for "normal" use).
> > See also the current "List of SSDs" thread in this ML.
> 
> This was a metric I struggled to find and would have been useful in
> comparison.  I am sourcing prices on the SM863s anyway.  That SSD thread
> has been good to follow as well.
> 
Yeah, they are most likely a better fit and if they are doing OK with sync
writes you could most likely get away with having their journals on them
same SSD.

> > Fast, reliable, cheap. Pick any 2.
> 
> Yup - unfortunately cheap is fixed, reliable is the reason we are doing
> this however fast is now a must have.....  the normal
> engineering/management dilemma.
> 
Indeed.

> > On your test setup or even better the Solaris one, have a look at
> > their media wearout, or  Wear_Leveling_Count as Samsung calls it.
> > I bet that makes for some scary reading.
> 
> For the Evos we found no tools we could use on Solaris - also because we
> have cheap nasty SAS interposers in that setup most tools don't work
> anyway.  Until we pull a disk and put it into a windows box we can't do
> any sort of diagnostics on it.  It would be useful to see because we
> have those disks taking a fair brunt of our performance workload now.
> 
smartmontools aka smartctl not working for you, presumably because of the
intermediate SAS shenanigans?

> > Note that Ceph (RBD/RADOS to be precise) isn't particular suited for
> > "long" distance replication due to the incurred latencies.
> >
> > That's unless your replication is happening "above" Ceph in the iSCSI
> > bits with something that's more optimized for this.
> >
> > Something along the lines of the DRBD proxy has been suggested for
> > Ceph, but if at all it is a backburner project at best from what I
> > gather.
> 
> We can fairly easily do low latency links (telco) but are looking at the
> architecture to try and limit that sort of long replication - doing
> replication at application and database levels instead.  The site to
> site replication would be limited to some clusters or applications that
> need sync replication for availability.
> 
Yeah, I figured the Telco part, but for our planned DC move I ran
some numbers and definitely want to stay below 10km between them
(Infiniband here).

Note that you can of course create CRUSH rules that will give you either
location replicated or only locally replicated OSDs and thus pools, but it
may be a bit daunting at first.

> > There are some ways around this, which may or may not be suitable for
> > your use case.
> > EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
> > Of course this comes at a performance penalty, which you can offset
> > again with for example fast RAID controllers with HW cache to some
> > extend. But it may well turn out to be zero sum game.
> 
> I modelled an EC setup but that was at a multi site level with local
> cache tiers in front, and it was going to be too big a challenge to do
> as a new untested platform with too many latency questions.  Within a
> site EC was to going to be cost effective as to do properly I would need
> to up the number of hosts and that pushed the pricing up too far, even
> if I went with smaller less configured hosts.
> 
Yes, the per-node basic cost can be an issue, but Ceph really likes many
smallish things over few largish ones for the same size.

> I thought about hardware RAID as well, but as I would need to do host
> level redundancy anyway it was not gaining any efficiency - less risk
> but I would still need to replicate anyway so why not just go disk to
> disk.  More than likely I would quietly work in higher protection as we
> go live and deal with it later as a capacity expansion.
> 
The later sounds like a plan.
For the former consider this simple example:
4 storage nodes, each with 4 RAID6 OSDs, Ceph size=2 and min_size=1,
mon_osd_down_out_subtree_limit = host.

In this scenario you can loose any 2 disks w/o an OSD going down, up to 4
disks w/o data loss and a whole node as well w/o the cluster stopping. 
The mon_osd_down_out_subtree_limit will also stop things from rebalancing
in case of a node crash/reboot, until you decide so otherwise manually.
The idea here is that it's likely a lot quicker to get a node back up than
to reshuffle all that data.

With normal, size 2 replication and single disk OSDs, any
simultaneous/overlapping loss of 2 disks is going to loose you data,
potentially effecting many if not ALL of your VM images.

There have been a lot of discussion about reliability with various
replication levels in this ML.

> > Another thing is to use a cache pool (with top of the line SSDs), this
> > is of course only a sensible course of action if your hot objects will
> > fit in there. In my case they do (about 10-20% of the 2.4TB raw pool
> > capacity) and everything is as fast as can be expected and the VMs
> > (their time critical/sensitive application to be precise) are happy
> > campers.
> 
> This is the model I am working to - our "fast" workloads using SSD
> caches  in front of bulk SATA, sizing the SSDs at around 25% of the
> capacity we require for "fast" storage.
> 
> For the "bulk" storage I would still use the SSD cache but sized to 10%
> of the SATA usable capacity.   I figure once we get live we can adjust
> numbers as required - expand with more cache hosts if needed.
>
Correct, of course the more you know about your clients, the better you
can plan ahead.

> > There's a counter in Ceph (counter-filestore_journal_bytes) that you
> > can graph for journal usage.
> > The highest I have ever seen is about 100MB for HDD based OSDs, less
> > than 8MB for SSD based ones with default(ish) Ceph parameters.
> >
> > Since you seem to have experience with ZFS (I don't really, but I read
> > alot ^o^), consider the Ceph journal equivalent to the ZIL.
> > It is a write only journal, it never gets read from unless there is a
> > crash. That is why sequential, sync write speed is the utmost criteria
> > for Ceph journal device.
> >
> > If I recall correctly you were testing with 4MB block streams, thus
> > pretty much filling the pipe to capacity, atop on your storage nodes
> > will give a good insight.
> >
> > The journal is great to cover some bursts, but the Ceph OSD is
> > flushing things from RAM to the backing storage on configurable time
> > limits and once these are exceeded and/or you run out RAM (pagecache),
> > you are limited to what your backing storage can sustain.
> >
> > Now in real life, you would want a cluster and especially OSDs that
> > are lightly to medium loaded on average and in that case a spike won't
> > result in a significant rise of latency.
> >
> > > > Have you tried the HDD based pool and did you see similar,
> > > > consistent interval, spikes?
> > >
> > > To be honest I have been focusing on the SSD numbers but that would
> > > be a good comparison.
> > >
> > > > Or alternatively, configured 2 of your NVMEs as OSDs?
> > >
> > > That was what I was thinking of doing - move the NVMEs to the
> > > frontends, make them OSDs and configure them as a read-forward cache
> > > tier for the other pools, and just have the SSDs and SATA journal by
> > > default on a first partition.
> > >
> > Madness lies down that path, also not what I meant.
> > For quick testing, leave the NVMEs right where they are, destroy your
> > SSD pool and create one with the 2 NVMEs per node as individual OSDs.
> > Test against that.
> 
> Have that in place now - was also a fun exercise in ceph management to
> dynamically reconfigure and rebuild 12 OSDs and then put the flash OSDs
> into their own crush root.  That is in play now but the numbers are
> really not what I expected.  I am going to work on it some more before I
> call anything.
> 
I'd be interested in the results, this _should_ work like a charm.

> >
> > A read forward cache tier is exactly the opposite of what you want,
> > you want your writes to be fast and hit the fastest game in town (your
> > NVMEs preferably) and thus want writeback mode.
> > Infernalis, or even better waiting for Jewel will help to keep the
> > cache as hot and unpolluted as possible with working recency
> > configurations for promotions.
> > But if anyhow possible, keep your base pools sufficiently fast as
> > well, so they can serve cache misses (promotions) or cache flushes
> > adequately. Keep in mind that a promotion or flush will (on average
> > for RDB objects) result in 4MB reads and writes.
> 
> My understanding was read forward would take the writes, but send the
> reads on to the backend.  Probably not a lot to save those reads but I
> wanted to ensure I was not hiding real read performance with a flash
> pool that was larger than the workload I am dealing with.
> 

Sorry, my bad. My brain auto-corrected "readforward" to "forward", since
this proxy mode isn't documented anywhere but in the PR to document it. ^o^
http://tracker.ceph.com/issues/14153

While I verified just now on my test cluster that it indeed works as
advertised and is a very useful cache mode for specific setups (which
would include mine to some extend), I'm rather leery of turning this on
with my production cluster.
Basically non-documented (thus by extension non-supported) and rather new,
with very little actual usage in the wild I presume.

Anybody using this in production with Hammer, feel free to pipe up with
experiences.

> 
> >
> > In your case the SSDs are totally unsuitable to hold journals and will
> > both perform miserably and wear out even faster.
> > And HDDs really benefit from SSD journals, especially when it comes to
> > IOPS.
> 
> For now I have left the SATAs with their journals on flash anyway - we
> are already pricing based on that config anyway.
> 
Nods.

> > I also recall your NVMEs being in a RAID1, presumably so that a
> > failure won't take out all your OSDs.
> > While understandable, it is also quite wasteful.
> > For starters you need to be able to sustain a node loss, so "half" a
> > node loss if a NVME fails must be within the the capability of your
> > cluster. This is why most people suggest starting with about 10
> > storage nodes for production clusters, of course budget permitting
> > (none of mine is that size yet).
> >
> > By using the NVMEs individually, you improve performance and lower
> > their write usage.
> > Specifically, those 400GB P3700 can write about 1000MB/s, which is
> > half your network speed and will only saturate about 10 of your 36
> > HDDs. And with Intel P3700s, you really don't have to worry about
> > endurance to boot.
> 
> Thanks.  The consideration was I didn't want to lose 36 or 18 OSDs due
> to a journal failure, so if we lost a card we could do a controlled
> replacement without totally rebuilding the OSDS (as they are PCI-e its
> host outage anyway).
> 
A sensible consideration, but it comes at a cost.

> We could maybe look to see if we can put 3 cards in and do 12 per
> journal, and just take the hit should we lose a single journal.
> 
Penultimately you will need to plan for the case that a node is completely
lost, at least for a sufficiently long duration that running w/o
replication would be far too high a risk.

> I would prefer to do a more distributed host setup however the compute
> cost pushes the overall pricing up significantly even if we use smaller
> less configured hosts, hence why I am going for more of higher density
> model.
> 

This resembles in many ways my approach to our first production cluster,
which was basically just 2 nodes with 2 RAID6 OSDs each.
If not for bad HDDs (Toshiba DT, see various threads here where I mention
them) and a year long delay between requesting an expansion to a 3rd node
it was and would have been sufficient.
Now of course there is a fast, large SSD cache on top of it to cover up
the underlying rot, at least until the HDDs can be replaced.

However this cluster only supports 2 type of otherwise identical VMs, with
very well known and predictable IO patterns (nearly 98% writes only).

Though when configuring Ceph for a broad variance of uses, classic deploys
scale a lot better.

Christian

> Very much appreciate your insight and advice.
> 
> Cheers,
>  Adrian
> 
> 
> 
> >
> >
> > Regards,
> >
> > Christian
> > > > No, not really. The journal can only buffer so much.
> > > > There are several threads about this in the archives.
> > > >
> > > > You could tune it but that will only go so far if your backing
> > > > storage can't keep up.
> > > >
> > > > Regards,
> > > >
> > > > Christian
> > >
> > >
> > > Agreed - Thanks for your help.
> > > Confidentiality: This email and any attachments are confidential and
> > > may be subject to copyright, legal or some other professional
> > > privilege. They are intended solely for the attention and use of the
> > > named addressee(s). They may only be copied, distributed or
> > > disclosed with the consent of the copyright owner. If you have
> > > received this email by mistake or by breach of the confidentiality
> > > clause, please notify the sender immediately by return email and
> > > delete or destroy all copies of the email. Any confidentiality,
> > > privilege or copyright is not waived or lost because this email has
> > > been sent to you by mistake.
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege.
> They are intended solely for the attention and use of the named
> addressee(s). They may only be copied, distributed or disclosed with the
> consent of the copyright owner. If you have received this email by
> mistake or by breach of the confidentiality clause, please notify the
> sender immediately by return email and delete or destroy all copies of
> the email. Any confidentiality, privilege or copyright is not waived or
> lost because this email has been sent to you by mistake.
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com