Re: 1256 OSD/21 server ceph cluster performance issues.

Christian Balzer <chibi@xxxxxxx> · Sat, 20 Dec 2014 11:10:28 +0900

Hello Sean,

On Fri, 19 Dec 2014 02:47:41 -0600 Sean Sullivan wrote:

> Hello Christian,
> 
> Thanks again for all of your help! I started a bonnie test using the 
> following::
> bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b
>
While that gives you a decent idea of what the limitations of kernelspace 
mounted RBD images are, it won't tell you what your cluster is actually
capable of in raw power.

For that use rados bench, however if your cluster is as brittle as it
seems, this may very well cause OSDs to flop, so look out for that.
Observe your nodes (a bit tricky with 21, but try) while this is going on.

To test the write throughput, do something like this:
"rados -p rbd bench 60 write  -t 64"

To see your CPUs melt and get an idea of the IOPS capability with 4k
blocks, do this:

"rados -p rbd bench 60 write  -t 64 -b 4096"

> Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
> clears the slow marker for now
> 
> kh10-9$ ceph -w
>      cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
>       health HEALTH_OK
>       monmap e1: 3 mons at

3 monitors, another recommendation/default that isn't really adequate for
a cluster of this size and magnitude. Because it means you can only loose
ONE monitor before the whole thing seizes up. I'd get 2 more (with DC
S3700, 100 or 200GB will do fine) and spread them among the racks. 

> {kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0}, 
> election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8
>       osdmap e15356: 1256 osds: 1256 up, 1256 in
>        pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
>              566 TB used, 4001 TB / 4567 TB avail
That's a lot of objects and data, was your cluster that full when before
it started to have problems?

>                 87560 active+clean
Odd number of PGs, it makes for 71 per OSD, a bit on the low side. OTOH
you're already having scaling issues of sorts, so probably leave it be for
now. How many pools?

>    client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s
> 
Is that a typical, idle, steady state example or is this while you're
running bonnie and pushing things into radosgw?

> 2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
> MB/s rd, 1090 MB/s wr, 5774 op/s
> 2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 
> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 
> MB/s rd, 1548 MB/s wr, 7552 op/s
> 2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 
> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 
> MB/s rd, 2284 MB/s wr, 10451 op/s
> 
> Once the next slow osd comes up I guess I can tell it to bump it's log 
> up to 5 and see what may be going on.
> 
> That said I didn't see much last time.
>
You may be looking at something I saw the other day, find my 
"Unexplainable slow request" thread. Which unfortunately still remains a
mystery.
But the fact that it worked until recently suggest loading issues.

> On 12/19/2014 12:17 AM, Christian Balzer wrote:
> > Hello,
> >
> > On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:
> >
> >> Wow Christian,
> >>
> >> Sorry I missed these in line replies. Give me a minute to gather some
> >> data. Thanks a million for the in depth responses!
> >>
> > No worries.
> >
> >> I thought about raiding it but I needed the space unfortunately. I
> >> had a 3x60 osd node test cluster that we tried before this and it
> >> didn't have this flopping issue or rgw issue I am seeing .
> >>
> > I think I remember that...
> 
> I  hope not. I don't think I posted about it at all. I only had it for a 
> short period before it was re purposed. I did post about a cluster 
> before that with 32 osds per node though. That one had tons of issues 
> but now seems to be running relatively smoothly.
> 
Might have been that then.

> >
> > You do realize that the RAID6 configuration option I mentioned would
> > actually give you MORE space (replication of 2 is sufficient with
> > reliable OSDs) than what you have now?
> > Albeit probably at reduced performance, how much would also depend on
> > the controllers used, but at worst the RAID6 OSD performance would be
> > equivalent to that of single disk.
> > So a Cluster (performance wise) with 21 nodes and 8 disks each.
> 
> Ah I must have misread, I thought you said raid 10 which would half the 
> storage and  a small write penalty. For a raid 6 of 4 drives I would get 
> something like 160 iops (assuming each drive is 75) which may be worth 
> it. I would just hate to have 2+ failures and lose 4-5 drives as opposed 
> to 2 and the rebuild for a raid 6 always left a sour taste in my mouth. 
> Still 4 slow drives is better than 4TB of data over the network slowing 
> down the whole cluster.
> 
Please re-read what I wrote, I suggested 2 alternatives, one with RAID10
that would indeed reduce your space by 25% (RAID replication of 2 and a
Ceph replication of 2 as well instead of 3, so going from an effective
replication of 3 to 4) but have a limited impact on performance. 

And one with RAID6 and 8 drives (4 makes no sense) which would give you
more space than you have now at the cost of lower performance.
Tripple disk failures in a set of 8 drives are extremely rare and with
spare disks as I suggested it would reduce this to near zero.

The thing you would have to definitely consider in your case is of course
to lay those RAIDs out in a way that pulling a tray doesn't impact things,
read all 8 disk on 4 trays and when a dead disk has been replaced fail the
spare that got used (if you have spares) to restore that situation.

Your chassis is really a PITA in many ways, dense as it may be.

Of course your current controllers can't do RAID6, but your CPUs are going
to be less stressed handling 8 software RAIDs and 8 OSDs than dealing with
60 OSDs...

If you can live with the 25% reduction in space, I'd go for the RAID10
approach as it gives you a decent number (336, 1/4 of your current number,
but still enough to spread the load nicely) of fast OSDs that are also
very resilient and unlikely to fail while keeping the OSD processes per
node to something your hardware can handle.

> I knew about the 40 cores being low but I thought at 2.7 we may be fine 
> as the docs recommend 1 X 1G xeons per osd. The cluster hovers around 
> 15-18 CPU but with the constant flipping disks I am seeing it bump up as 
> high as 120 when a disk is marked as out of the cluster.
> 
As I wrote, that recommendation is for non-SSD cluster. And optimistic.

> kh10-3$ cat /proc/loadavg
> 14.35 29.50 66.06 14/109434 724476
>
Interesting and scary, but not unexpected. 
But load is a course measurement, try to observe things in more detail
with atop or given your cluster size maybe collectd and graphite.

> 
> 
> 
> 
> >   
> > No need, now that strange monitor configuration makes sense, you (or
> > whoever spec'ed this) went for the Supermicro Ceph solution, right?
> indeed.
Well, that explains that then.
Given the price of 15K SAS drives compared to DC S3700s it's rather silly.
^.^

> > In my not so humble opinion, this the worst storage chassis ever
> > designed by a long shot and totally unsuitable for Ceph.
> > I told the Supermicro GM for Japan as much. ^o^
> Well it looks like I done goofed. I thought it was odd that they went 
> against most of what ceph documentation says about recommended hardware. 
> I read/heard from them that they worked with intank on this though so I 
> was swayed. Besides that we really needed the density per rack due to 
> limited floor space. As I said in capable hands this cluster would work 
> but by stroke of luck..
> 
You could have achieved that density probably with some of the top loaders
and RAID6, but you're stuck with this one for now.

How are your SSDs distributed amongst those trays?

With 2 SSDs per tray you probably have slightly less bandwidth than with 1
HDD and 1 SSD, but these SSDs are very unlikely to fail. So having to shut
down 5 OSDs because a HDD failed (likely event) on the shared tray as
opposed to shutting down 5 OSDs because the neighboring SSD failed
(unlikely event).

> 
> > Every time a HDD dies, you will have to go and shut down the other OSD
> > that resides on the same tray (and set the cluster to noout).
> > Even worse of course if a SSD should fail.
> > And if somebody should just go and hotswap things w/o that step first,
> > hello data movement storm (2 or 10 OSDs instead of 1 or 5
> > respectively).
> >
> > Christian
> Thanks for your help and insight on this! I am going to take a nap and 
> hope the cluster doesn't set fire before I wake up o_o
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com