Re: Slow/Hung IOs

Christian Balzer <chibi@xxxxxxx> · Tue, 6 Jan 2015 17:25:08 +0900

On Mon, 5 Jan 2015 22:36:29 +0000 Sanders, Bill wrote:

> Hi Ceph Users,
> 
> We've got a Ceph cluster we've built, and we're experiencing issues with
> slow or hung IO's, even running 'rados bench' on the OSD cluster.
> Things start out great, ~600 MB/s, then rapidly drops off as the test
> waits for IO's. Nothing seems to be taxed... the system just seems to be
> waiting.  Any help trying to figure out what could cause the slow IO's
> is appreciated.
> 
I assume nothing in the logs of the respective OSDs either?
Kernel or other logs equally silent?

Watching things with atop (while running the test) not showing anything
particular?

Looking at the myriad of throttles and other data in 
http://ceph.com/docs/next/dev/perf_counters/ 
might be helpful for the affected OSDs.

Having this kind of (consistent?) trouble feels like a networking issue of
sorts, OSDs not able to reach each other or something massively messed up
in the I/O stack.

[snip]

> Our ceph cluster is 4x Dell R720xd nodes:
> 2x1TB spinners configured in RAID for the OS
> 10x4TB spinners for OSD's (XFS)
> 2x400GB SSD's, each with 5x~50GB OSD journals
> 2x Xeon E5-2620 CPU (/proc/cpuinfo reports 24 cores)
> 128GB RAM
> Two networks (public+cluster), both over infiniband
> 
Usual IB kernel tuning done, network stack stuff and vm/min_free_kbytes to
512MB at least?

> Three monitors are configured on the first three nodes, and use a chunk
> of one of the SSDs for their data, on an XFS partition
> 
Since you see nothing in the logs probably not your issue, but monitors
like the I/O for their leveldb fast, SSD recommended. 

> Software:
> SLES 11SP3, with some in house patching. (3.0.1 kernel, "ceph-client"
> backported from 3.10) Ceph version: ceph-0.80.5-0.9.2, packaged by SUSE
> 
Can't get a 3.16 backport for this?

> ceph.conf:
> fsid = 3e8dbfd8-c3c8-4d30-80e2-cd059619d757
> mon initial members = tvsaq1, tvsaq2, tvsar1
> mon host = 39.7.48.6, 39.7.48.7, 39.7.48.8
> 
> cluster network = 39.64.0.0/12
> public network = 39.0.0.0/12
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> osd journal size = 9000
Not sure how this will affect things given that you have 50GB partitions.

I'd remove that line and replace it with something like:

 filestore_max_sync_interval = 30

(I use 10 with 10GB journals)

Regards,

Christian

> filestore xattr use omap = true
> osd crush update on start = false
> osd pool default size = 3
> osd pool default min size = 1
> osd pool default pg num = 4096
> osd pool default pgp num = 4096
> 
> mon clock drift allowed = .100
> osd mount options xfs = rw,noatime,inode64
> 
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com