ceph osd crush tunables optimal AND add new OSD at the same time

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Thu, 17 Jul 2014 11:16:53 -0700

I'd like to see some way to cap recovery IOPS per OSD.  Don't allow
backfill to do no more than 50 operations per second.  It will slow
backfill down, but reserve plenty of IOPS for normal operation.  I know
that implementing this well is not a simple task.

I know I did some stupid things that caused a lot of my problems.  Most of
my problems can be traced back to
  osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096
 and the kernel malloc problems it caused.

Reformatting all of the disks fixed a lot of my issues, but it didn't fix
them all.

While I was reformatting my secondary cluster, I tested the stability by
reformatting all of the disks on the last node at once.  I didn't mark them
out and wait for the rebuild; I removed the OSDs, reformatted, and added
them back to the cluster.  It was 10 disks out of 36 total, in a 4 node
cluster (I'm waiting for hardware to free up to build the 5th node).
 Everything was fine for the first hour or so.  After several hours, there
was enough latency that the HTTP load balancer was marking RadosGW nodes
down.  My load balancer has a 30s timeout.  Since the latency was cluster
wide, all RadosGW nodes were marked down together.  When the latency spike
subsided, they'd all get marked up again.  This continued until the
backfill completed.  They were mostly up.  I don't have numbers, but I
think they were marked down about 5 times an hour, for less than a minute
each time.  That really messes with radosgw-agent.

I had recovery tuned down:
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

I have journals on SSD, and single GigE public and cluster networks.  This
cluster has 2x replication (I'm waiting for the 5th node to go to 3x).  The
cluster network was pushing 950 Mbps.  The SSDs and OSDs had plenty of
write bandwidth, but the HDDs were saturating their IOPs.  These are
consumer class 7200 RPM SATA disks, so they don't have very many IOPS.

The average write latency on these OSDs is normally ~10ms.  While this
backfill was going on, the average write latency was 100ms, with plenty of
times when the latency was 200ms.  The average read latency increased, but
not as bad.  It averaged 50ms, with occasional spikes up to 400ms.  Since I
formatted a 27% of my cluster, I was seeing higher latency on 55% of my
OSDs (readers and writers).

Instead, if I trickle in the disks, everything works fine.  I was able to
reformat 2 OSDs at a time without a problem.  The cluster latency increase
was barely noticeable, even though the IOPS on those two disks were
saturated.  A bit of latency here and there (5% of the time) doesn't hurt
much.  When it's 55% of the time, it hurts a lot more.

When I finally get the 5th node, and increase replication from 2x to 3x, I
expect this cluster to be unusable for about a week.

On Thu, Jul 17, 2014 at 9:02 AM, Andrei Mikhailovsky <andrei at arhont.com>
wrote:

> Comments inline
>
>
> ------------------------------
> *From: *"Sage Weil" <sweil at redhat.com>
> *To: *"Quenten Grasso" <qgrasso at onq.com.au>
> *Cc: *ceph-users at lists.ceph.com
> *Sent: *Thursday, 17 July, 2014 4:44:45 PM
>
> *Subject: *Re: [ceph-users] ceph osd crush tunables optimal AND add new
> OSD at the same time
>
> On Thu, 17 Jul 2014, Quenten Grasso wrote:
>
> > Hi Sage & List
> >
> > I understand this is probably a hard question to answer.
> >
> > I mentioned previously our cluster is co-located MON?s on OSD servers,
> which
> > are R515?s w/ 1 x AMD 6 Core processor & 11 3TB OSD?s w/ dual 10GBE.
> >
> > When our cluster is doing these busy operations and IO has stopped as in
> my
> > case, I mentioned earlier running/setting tuneable to optimal or heavy
> > recovery
> >
> > operations is there a way to ensure our IO doesn?t get completely
> > blocked/stopped/frozen in our vms?
> >
> > Could it be as simple as putting all 3 of our mon servers on baremetal
> >  w/ssd?s? (I recall reading somewhere that a mon disk was doing several
> > thousand IOPS during a recovery operation)
> >
> > I assume putting just one on baremetal won?t help because our mon?s will
> only
> > ever be as fast as our slowest mon server?
>
> I don't think this is related to where the mons are (most likely).  The
> big question for me is whether IO is getting completely blocked, or just
> slowed enough that the VMs are all timing out.
>
>
> AM: I was looking at the cluster status while the rebalancing was taking
> place and I was seeing very little client IO reported by ceph -s output.
> The numbers were around 20-100 whereas our typical IO for the cluster is
> around 1000. Having said that, this was not enough as _all_ of our vms
> become unresponsive and didn't recover after rebalancing finished.
>
>
> What slow request messages
> did you see during the rebalance?
>
> AM: As I was experimenting with different options while trying to gain
> some client IO back i've noticed that when I am limiting the options to 1
> per osd ( osd max backfills = 1, osd recovery max active = 1, osd
> recovery threads = 1), I did not have any slow or blocked requests at
> all. Increasing these values did produce some blocked requests
> occasionally, but they were being quickly cleared.
>
>
> What were the op latencies?
>
> AM: In general, the latencies were around 5-10 higher compared to the
> normal cluster ops. The second column of the "ceph osd perf" was around 50s
> where as it is typically between 3-10. It did occasionally jump to some
> crazy numbers like 2000-3000 on several osds, but only for 5-10 seconds.
>
> It's
> possible there is a bug here, but it's also possible the cluster is just
> operating close enough to capacity that the additional rebalancing work
> pushes it into a place where it can't keep up and the IO latencies are
> too high.
>
>
> AM: My cluster in particular is under-utilised for the majority of time. I
> do not typically see osds more than 20-30% utilised and our ssd journals
> are usually less than 10% utilised.
>
>
> Or that we just have more work to do prioritizing requests..
> but it's hard to say without more info.
>
> sage
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140717/6bb92358/attachment.htm>