Re: 1256 OSD/21 server ceph cluster performance issues.

Sean Sullivan <seapasulli@xxxxxxxxxxxx> · Mon, 22 Dec 2014 16:37:22 -0600

Hello Christian,

Sorry for the long wait. Actually I have done a rados bench earlier on
in the cluster without any failure but it did take a while. That and
there is actually a lot of data being downloaded to the cluster now.
Here are the rados results for 100 seconds::
http://pastebin.com/q5E6JjkG

On 12/19/2014 08:10 PM, Christian Balzer wrote
> Hello Sean,
>
> On Fri, 19 Dec 2014 02:47:41 -0600 Sean Sullivan wrote:
>
>> Hello Christian,
>>
>> Thanks again for all of your help! I started a bonnie test using the 
>> following::
>> bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b
>>
> While that gives you a decent idea of what the limitations of kernelspace e -
> mounted RBD images are, it won't tell you what your cluster is actually
> capable of in raw power.
Indeed I agree here, and I am not interested in raw power at this point
as I am a bit past this. I performed a rados test prior and it seemed to
do pretty well, or as expected. What I have noticed in rados bench tests
is that the test can only go as fast as the client network can allow.
The above seems to demonstrate this as well. If I were to start two
rados bench tests from two different hosts I am confident I can push
above 1100 Mbps without any issue.

>
> For that use rados bench, however if your cluster is as brittle as it
> seems, this may very well cause OSDs to flop, so look out for that.
> Observe your nodes (a bit tricky with 21, but try) while this is going on.
>
> To test the write throughput, do something like this:
> "rados -p rbd bench 60 write  -t 64"
>  
> To see your CPUs melt and get an idea of the IOPS capability with 4k
> blocks, do this:
>
> "rados -p rbd bench 60 write  -t 64 -b 4096"
>  
I will try with 4k blocks next to see how this works out. I honestly
think that the cluster will stress but should be able to handle it. A
rebuild on failure will be scary however.

>> Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
>> clears the slow marker for now
>>
>> kh10-9$ ceph -w
>>      cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
>>       health HEALTH_OK
>>       monmap e1: 3 mons at
> 3 monitors, another recommendation/default that isn't really adequate for
> a cluster of this size and magnitude. Because it means you can only loose
> ONE monitor before the whole thing seizes up. I'd get 2 more (with DC
> S3700, 100 or 200GB will do fine) and spread them among the racks. 
The plan is to scale out the monitors to have two more, they have not
arrived yet but that is in the plan. I agree about the number of
monitors. When I talked to Inktank/redhat about this when I was testing
the 36 disk storage node cluster.  though they said something along the
lines of we shouldn't need 2 more until we have a much larger cluster.
Just know that two more monitors are indeed on the way and that this is
a known issue.

>  
>> {kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0}, 
>> election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8
>>       osdmap e15356: 1256 osds: 1256 up, 1256 in
>>        pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
>>              566 TB used, 4001 TB / 4567 TB avail
> That's a lot of objects and data, was your cluster that full when before
> it started to have problems?
This is due to the rados benches I ran as well as the massive amount of
data we are transferring to the current cluster.
We have 20 pools currently:
1 data,2 rbd,3 .rgw,4 .rgw.root,5 .rgw.control,6 .rgw.gc,7
.rgw.buckets,8 .rgw.buckets.index,9 .log,10 .intent-log,11 .usage,12
.users,13 .users.email,14 .users.swift,15 .users.uid,16 volumes,18
vms,19 .rgw.buckets.extra,20 images,

data and rbd will be removed once I am done testing. these pools were my
test pools I created.  The rest are the standard s3/swift // openstack
pools.

>>                 87560 active+clean
> Odd number of PGs, it makes for 71 per OSD, a bit on the low side. OTOH
> you're already having scaling issues of sorts, so probably leave it be for
> now. How many pools?
20 pools, but we will only have 18 once I delete data and rbd (these
were just testing pools to begin with).
>
>>    client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s
>>
> Is that a typical, idle, steady state example or is this while you're
> running bonnie and pushing things into radosgw?
I am doing both actually. The downloads into radosgw can't be stopped
right now but I can stop the bonnie tests.

>
>> 2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
>> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
>> MB/s rd, 1090 MB/s wr, 5774 op/s
>> 2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 
>> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 
>> MB/s rd, 1548 MB/s wr, 7552 op/s
>> 2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 
>> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 
>> MB/s rd, 2284 MB/s wr, 10451 op/s
>>
>> Once the next slow osd comes up I guess I can tell it to bump it's log 
>> up to 5 and see what may be going on.
>>
>> That said I didn't see much last time.
>>
> You may be looking at something I saw the other day, find my 
> "Unexplainable slow request" thread. Which unfortunately still remains a
> mystery.
> But the fact that it worked until recently suggest loading issues.

Thank you so much! The main issue I am having is that my radosgw is only
able to push out around 8mbps per client. The odd thing is I can start
1000 clients and radosgw seems to be able to handle them without any
further issue but still only 8mbps per client.

The other thing is that on another rack with only 3 storage nodes which
has the same configuration exactly is able to get around 600-1100mb/s on
a single client without any issue. The hardware used is the same so I do
not understand why, seemingly all of the sudden, I am having such slow
rados performance.

I know the other issue is the amount of OSDs we have per chassis and per
cpu which seems to cause osds to be marked as slow. While I believe this
issue is related I am having a hard time believing that this is the
entire reason why radosgw/civetweb is having such slow throughput per
client while leaving RBD and librados performance somewhat intact.

I know you recommended that I look at the performance dump of an osd to
check and see if anything is being throttled on one of the 'slow' osds::
http://paste.ubuntu.com/9599184/

Looking at the throttle columns here I do not see any with any failures
but I do think that the latency seems a bit high as well as the 17
dirtied ios.  I tried to find documentation about what all of these
fields mean but all I can find is the developer links here::

http://ceph.com/docs/master/dev/perf_counters/
http://ceph.com/docs/master/dev/osd_internals/
http://ceph.com/docs/master/dev/osd_internals/wbthrottle/

Is there a better location to track these fields down outside of the
source code itself?

Thank you again for all of your help. Sorry for any dense questions, I
am just trying to understand as much as I can from this ^_^

>
>> On 12/19/2014 12:17 AM, Christian Balzer wrote:
>>> Hello,
>>>
>>> On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:
>>>
>>>> Wow Christian,
>>>>
>>>> Sorry I missed these in line replies. Give me a minute to gather some
>>>> data. Thanks a million for the in depth responses!
>>>>
>>> No worries.
>>>
>>>> I thought about raiding it but I needed the space unfortunately. I
>>>> had a 3x60 osd node test cluster that we tried before this and it
>>>> didn't have this flopping issue or rgw issue I am seeing .
>>>>
>>> I think I remember that...
>> I  hope not. I don't think I posted about it at all. I only had it for a 
>> short period before it was re purposed. I did post about a cluster 
>> before that with 32 osds per node though. That one had tons of issues 
>> but now seems to be running relatively smoothly.
>>
> Might have been that then.
>
>>> You do realize that the RAID6 configuration option I mentioned would
>>> actually give you MORE space (replication of 2 is sufficient with
>>> reliable OSDs) than what you have now?
>>> Albeit probably at reduced performance, how much would also depend on
>>> the controllers used, but at worst the RAID6 OSD performance would be
>>> equivalent to that of single disk.
>>> So a Cluster (performance wise) with 21 nodes and 8 disks each.
>> Ah I must have misread, I thought you said raid 10 which would half the 
>> storage and  a small write penalty. For a raid 6 of 4 drives I would get 
>> something like 160 iops (assuming each drive is 75) which may be worth 
>> it. I would just hate to have 2+ failures and lose 4-5 drives as opposed 
>> to 2 and the rebuild for a raid 6 always left a sour taste in my mouth. 
>> Still 4 slow drives is better than 4TB of data over the network slowing 
>> down the whole cluster.
>>
> Please re-read what I wrote, I suggested 2 alternatives, one with RAID10
> that would indeed reduce your space by 25% (RAID replication of 2 and a
> Ceph replication of 2 as well instead of 3, so going from an effective
> replication of 3 to 4) but have a limited impact on performance. 
>
> And one with RAID6 and 8 drives (4 makes no sense) which would give you
> more space than you have now at the cost of lower performance.
> Tripple disk failures in a set of 8 drives are extremely rare and with
> spare disks as I suggested it would reduce this to near zero.
>
> The thing you would have to definitely consider in your case is of course
> to lay those RAIDs out in a way that pulling a tray doesn't impact things,
> read all 8 disk on 4 trays and when a dead disk has been replaced fail the
> spare that got used (if you have spares) to restore that situation.
>
> Your chassis is really a PITA in many ways, dense as it may be.
>
> Of course your current controllers can't do RAID6, but your CPUs are going
> to be less stressed handling 8 software RAIDs and 8 OSDs than dealing with
> 60 OSDs...
>
> If you can live with the 25% reduction in space, I'd go for the RAID10
> approach as it gives you a decent number (336, 1/4 of your current number,
> but still enough to spread the load nicely) of fast OSDs that are also
> very resilient and unlikely to fail while keeping the OSD processes per
> node to something your hardware can handle.
>
>> I knew about the 40 cores being low but I thought at 2.7 we may be fine 
>> as the docs recommend 1 X 1G xeons per osd. The cluster hovers around 
>> 15-18 CPU but with the constant flipping disks I am seeing it bump up as 
>> high as 120 when a disk is marked as out of the cluster.
>>
> As I wrote, that recommendation is for non-SSD cluster. And optimistic.
>
>> kh10-3$ cat /proc/loadavg
>> 14.35 29.50 66.06 14/109434 724476
>>
> Interesting and scary, but not unexpected. 
> But load is a course measurement, try to observe things in more detail
> with atop or given your cluster size maybe collectd and graphite.
>  
>>
>>
>>
>>>   
>>> No need, now that strange monitor configuration makes sense, you (or
>>> whoever spec'ed this) went for the Supermicro Ceph solution, right?
>> indeed.
> Well, that explains that then.
> Given the price of 15K SAS drives compared to DC S3700s it's rather silly.
> ^.^
>
>>> In my not so humble opinion, this the worst storage chassis ever
>>> designed by a long shot and totally unsuitable for Ceph.
>>> I told the Supermicro GM for Japan as much. ^o^
>> Well it looks like I done goofed. I thought it was odd that they went 
>> against most of what ceph documentation says about recommended hardware. 
>> I read/heard from them that they worked with intank on this though so I 
>> was swayed. Besides that we really needed the density per rack due to 
>> limited floor space. As I said in capable hands this cluster would work 
>> but by stroke of luck..
>>
> You could have achieved that density probably with some of the top loaders
> and RAID6, but you're stuck with this one for now.
>
> How are your SSDs distributed amongst those trays?
>
> With 2 SSDs per tray you probably have slightly less bandwidth than with 1
> HDD and 1 SSD, but these SSDs are very unlikely to fail. So having to shut
> down 5 OSDs because a HDD failed (likely event) on the shared tray as
> opposed to shutting down 5 OSDs because the neighboring SSD failed
> (unlikely event).
>
>>> Every time a HDD dies, you will have to go and shut down the other OSD
>>> that resides on the same tray (and set the cluster to noout).
>>> Even worse of course if a SSD should fail.
>>> And if somebody should just go and hotswap things w/o that step first,
>>> hello data movement storm (2 or 10 OSDs instead of 1 or 5
>>> respectively).
>>>
>>> Christian
>> Thanks for your help and insight on this! I am going to take a nap and 
>> hope the cluster doesn't set fire before I wake up o_o
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com