Re: 1256 OSD/21 server ceph cluster performance issues.

Christian Balzer <chibi@xxxxxxx> · Fri, 19 Dec 2014 13:57:35 +0900

Hello,

Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er,
wrong movie. ^o^

On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote:

> Hello Yall!
> 
> I can't figure out why my gateways are performing so poorly and I am not
> sure where to start looking. My RBD mounts seem to be performing fine
> (over 300 MB/s) 
>
I wouldn't call 300MB/s writes fine with a cluster of this size. 
How are you testing this (which tool, settings, from where)?

> while uploading a 5G file to Swift/S3 takes 2m32s
> (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
> with nuttcp shows that I can transfer from a client with 10G interface
> to any node on the ceph cluster at the full 10G and ceph can transfer
> close to 20G between itself. I am not really sure where to start looking
> as outside of another issue which I will mention below I am clueless.
> 
I know nuttin about radosgw, but I wouldn't be surprised that the
difference you see here is based how that is eventually written to the
storage (smaller chunks than what you're using to test RBD performance).

> I have a weird setup
I'm always interested in monster storage nodes, care to share what case
this is?

> [osd nodes]
> 60 x 4TB 7200 RPM SATA Drives
What maker/model?

> 12 x  400GB s3700 SSD drives
Journals, one assumes. 

> 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
> the 3 cards)
I smell a port-expander or 3 on your backplane. 
And while making sure that your SSDs get undivided 6Gb/s love would
probably help, you still have plenty of bandwidth here (4.5Gb/s per
drive), so no real issue.

> 512 GB of RAM
Sufficient.

> 2 x CPU E5-2670 v2 @ 2.50GHz
Vastly, and I mean VASTLY insufficient.
It would still be 10GHz short of the (optimistic IMHO) recommendation of
1GHz per OSD w/o SSD journals. 
With SSD journals my experience shows that with certain write patterns
even 3.5GHz per OSD isn't sufficient. (there are several threads
about this here)

> 2 x 10G interfaces  LACP bonded for cluster traffic
> 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
> ports)
> 
Your journals could handle 5.5GB/s, so you're limiting yourself here a
bit, but not too horribly.

If I had been given this hardware, I would have RAIDed things (different
controller) to keep the number of OSDs per node to something the CPUs (any
CPU really!) can handle. 
Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for
performance and  8 x 8HDD RAID6 + SSDs +spares for capacity.
That still gives you 336 or 168 OSDs, allows for a replication size of 2
and as bonus you'll probably never have to deal with a failed OSD. ^o^

> [monitor nodes and gateway nodes]
> 4 x 300G 1500RPM SAS drives in raid 10
I would have used Intel DC S3700s here as well, mons love their leveldb to
be fast but
> 1 x SAS 2208
combined with this it should be fine.

> 64G of RAM
> 2 x CPU E5-2630 v2
> 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)
> 
> 
> Here is a pastebin dump of my details, I am running ceph giant 0.87 
> (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
> across the entire cluster.
> 
> http://pastebin.com/XQ7USGUz -- ceph health detail
That looks positively scary, blocked requests for hours...

> http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
> http://pastebin.com/BC3gzWhT -- ceph osd tree
scroll, scroll, woah! ^o^

> http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
> http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)
> 
> 
> We ran into a few issues with density (conntrack limits, pid limit, and
> number of open files) all of which I adjusted by bumping the ulimits in
> /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
> signs of these limits being hit so I have not included my limits or
> sysctl conf. If you like this as well let me know and I can include it.
> 
> One of the issues I am seeing is that OSDs have started to flop/ be
> marked as slow. The cluster was HEALTH_OK with all of the disks added
> for over 3 weeks before this behaviour started. 
Anything changed? 
In particular I assume this a new cluster, has much data been added?
A "ceph -s" output would be nice and educational.

Can you correlate the time when you start seeing slow, blocked requests
with scrubs or deep-scrubs? If so try setting your cluster temporarily to
noscrub and nodeep-scrub and see if that helps.  In case it does, setting  "osd_scrub_sleep" (start with something high like 1.0 or 0.5 and lower until it hurts again) should help permanently.

I have a cluster that could scrub things in minutes until the amount of
objects/data and steady load reached a threshold and now its hours.

In this context, check the fragmentation of your OSDs.

How busy (ceph.log ops/s) is your cluster at these times?

> RBD transfers seem to be
> fine for the most part which makes me think that this has little baring
> on the gateway issue but it may be related. Rebooting the OSD seems to
> fix this issue.
>
Do you see the same OSDs misbehaving over and over again or is this fully
random?

How busy are your storage nodes? CPU wise mostly, atop is a nice tool to
check this.

My guess w/o further data at this point would be that you're running out
of CPU at certain times, explaining your flopping OSDs. 

Regards,

Christian

> I would like to figure out the root cause of both of these issues and
> post the results back here if possible (perhaps it can help other
> people). I am really looking for a place to start looking at as the
> gateway just outputs that it is posting data and all of the logs
> (outside of the monitors reporting down osds) seem to show a fully
> functioning cluster.
> 
> Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as
> well.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com