Re: 1256 OSD/21 server ceph cluster performance issues.

Sean Sullivan <seapasulli@xxxxxxxxxxxx> · Fri, 19 Dec 2014 02:47:41 -0600

Hello Christian,

Thanks again for all of your help! I started a bonnie test using the 
following::
bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b

Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
clears the slow marker for now

kh10-9$ ceph -w
    cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
     health HEALTH_OK
     monmap e1: 3 mons at 
{kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0}, 
election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8
     osdmap e15356: 1256 osds: 1256 up, 1256 in
      pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
            566 TB used, 4001 TB / 4567 TB avail
               87560 active+clean
  client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s

2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
MB/s rd, 1090 MB/s wr, 5774 op/s
2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 
MB/s rd, 1548 MB/s wr, 7552 op/s
2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 
MB/s rd, 2284 MB/s wr, 10451 op/s

Once the next slow osd comes up I guess I can tell it to bump it's log 
up to 5 and see what may be going on.

That said I didn't see much last time.

On 12/19/2014 12:17 AM, Christian Balzer wrote:
Hello,

On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:

Wow Christian,

Sorry I missed these in line replies. Give me a minute to gather some
data. Thanks a million for the in depth responses!

No worries.

I thought about raiding it but I needed the space unfortunately. I had a
3x60 osd node test cluster that we tried before this and it didn't have
this flopping issue or rgw issue I am seeing .

I think I remember that...

I  hope not. I don't think I posted about it at all. I only had it for a 
short period before it was re purposed. I did post about a cluster 
before that with 32 osds per node though. That one had tons of issues 
but now seems to be running relatively smoothly.

You do realize that the RAID6 configuration option I mentioned would
actually give you MORE space (replication of 2 is sufficient with reliable
OSDs) than what you have now?
Albeit probably at reduced performance, how much would also depend on the
controllers used, but at worst the RAID6 OSD performance would be
equivalent to that of single disk.
So a Cluster (performance wise) with 21 nodes and 8 disks each.

Ah I must have misread, I thought you said raid 10 which would half the 
storage and  a small write penalty. For a raid 6 of 4 drives I would get 
something like 160 iops (assuming each drive is 75) which may be worth 
it. I would just hate to have 2+ failures and lose 4-5 drives as opposed 
to 2 and the rebuild for a raid 6 always left a sour taste in my mouth. 
Still 4 slow drives is better than 4TB of data over the network slowing 
down the whole cluster.

I knew about the 40 cores being low but I thought at 2.7 we may be fine 
as the docs recommend 1 X 1G xeons per osd. The cluster hovers around 
15-18 CPU but with the constant flipping disks I am seeing it bump up as 
high as 120 when a disk is marked as out of the cluster.

kh10-3$ cat /proc/loadavg
14.35 29.50 66.06 14/109434 724476

No need, now that strange monitor configuration makes sense, you (or
whoever spec'ed this) went for the Supermicro Ceph solution, right?
indeed.
In my not so humble opinion, this the worst storage chassis ever designed
by a long shot and totally unsuitable for Ceph.
I told the Supermicro GM for Japan as much. ^o^
Well it looks like I done goofed. I thought it was odd that they went 
against most of what ceph documentation says about recommended hardware. 
I read/heard from them that they worked with intank on this though so I 
was swayed. Besides that we really needed the density per rack due to 
limited floor space. As I said in capable hands this cluster would work 
but by stroke of luck..

Every time a HDD dies, you will have to go and shut down the other OSD
that resides on the same tray (and set the cluster to noout).
Even worse of course if a SSD should fail.
And if somebody should just go and hotswap things w/o that step first,
hello data movement storm (2 or 10 OSDs instead of 1 or 5 respectively).

Christian
Thanks for your help and insight on this! I am going to take a nap and 
hope the cluster doesn't set fire before I wake up o_o
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com