Re: Slow/Hung IOs

"Sanders, Bill" <Bill.Sanders@xxxxxxxxxxxx> · Tue, 6 Jan 2015 00:07:26 +0000

Thanks for the reply.

14 and 18 happened to show up during that run, but its certainly not only those OSD's.  It seems to vary each run.  Just from the runs I've done today I've seen the following pairs of OSD's:

['0,13', '0,18', '0,24', '0,25', '0,32', '0,34', '0,36', '10,22', '11,30', '12,28', '13,30', '14,22', '14,24', '14,27', '14,30', '14,31', '14,33', '14,34', '14,35', '14,39', '16,20', '16,27', '18,38', '19,30', '19,31', '19,39', '20,38', '22,30', '26,37', '26,38',
 '27,33', '27,34', '27,36', '28,32', '28,34', '28,36', '28,37', '3,18', '3,27', '3,29', '3,37', '4,10', '4,29', '5,19', '5,37', '6,25', '9,28', '9,29', '9,37']

Which is almost all of the OSD's in the system.

Bill

From: Lincoln Bryant [lincolnb@xxxxxxxxxxxx]

Sent: Monday, January 05, 2015 3:40 PM

To: Sanders, Bill

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Slow/Hung IOs

Hi BIll,

From your log excerpt, it looks like your slow requests are happening on OSDs 14 and 18. Is it always these two OSDs?

If you don't have a long recovery time (e.g., the cluster is just full of test data), maybe you could try setting OSDs 14 and 18 out and re-benching?

Alternatively I suppose you could just use bonnie++ or dd etc to write to those OSDs (careful to not clobber any Ceph dirs) and see how the performance looks. 

Cheers,
Lincoln

On Jan 5, 2015, at 4:36 PM, Sanders, Bill wrote:

Hi Ceph Users,

We've got a Ceph cluster we've built, and we're experiencing issues with slow or hung IO's, even running 'rados bench' on the OSD cluster.  Things start out great, ~600 MB/s, then rapidly drops off as the test waits for IO's. Nothing seems to be taxed... the
 system just seems to be waiting.  Any help trying to figure out what could cause the slow IO's is appreciated.

For example, 'rados -p rbd bench 60 write -t 32' takes over 900s to complete:

A typical rados bench:

 Total time run:         957.458274

Total writes made:      9251

Write size:             4194304

Bandwidth (MB/sec):     38.648 

Stddev Bandwidth:       157.323

Max bandwidth (MB/sec): 964

Min bandwidth (MB/sec): 0

Average Latency:        3.21126

Stddev Latency:         51.9546

Max latency:            910.72

Min latency:            0.04516

According to ceph.log, we're not experiencing any OSD flapping or monitor election cycles, just slow requests:

# grep slow /var/log/ceph/ceph.log:

2015-01-05 13:42:42.937678 osd.18 39.7.48.7:6803/11185 220 : [WRN] 3 slow requests, 1 included below; oldest blocked for > 513.611379 secs

2015-01-05 13:42:42.937685 osd.18 39.7.48.7:6803/11185 221 : [WRN] slow request 30.136429 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write 114688~512] 3.841c0edf ondisk+write e994) v4
 currently waiting for subops from 3,37

2015-01-05 13:42:49.938681 osd.18 39.7.48.7:6803/11185 222 : [WRN] 3 slow requests, 1 included below; oldest blocked for > 520.612372 secs

2015-01-05 13:42:49.938688 osd.18 39.7.48.7:6803/11185 223 : [WRN] slow request 480.636547 seconds old, received at 2015-01-05 13:34:49.302080: osd_op(client.92008.1:3100010 rb.0.140d.238e1f29.000000000c77 [write 3622400~512] 3.d031a69f ondisk+write e994) v4
 currently waiting for subops from 26,37

2015-01-05 13:43:12.941838 osd.18 39.7.48.7:6803/11185 224 : [WRN] 3 slow requests, 1 included below; oldest blocked for > 543.615545 secs

2015-01-05 13:43:12.941844 osd.18 39.7.48.7:6803/11185 225 : [WRN] slow request 60.140595 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write 114688~512] 3.841c0edf ondisk+write e994) v4
 currently waiting for subops from 3,37

2015-01-05 13:44:04.933440 osd.14 39.7.48.7:6818/11640 251 : [WRN] 4 slow requests, 1 included below; oldest blocked for > 606.941954 secs

2015-01-05 13:44:04.933469 osd.14 39.7.48.7:6818/11640 252 : [WRN] slow request 240.101138 seconds old, received at 2015-01-05 13:40:04.832272: osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.000000000010 [write 475136~512] 3.5e623815 ondisk+write e994) v4
 currently waiting for subops from 27,33

2015-01-05 13:44:12.950805 osd.18 39.7.48.7:6803/11185 226 : [WRN] 3 slow requests, 1 included below; oldest blocked for > 603.624511 secs

2015-01-05 13:44:12.950812 osd.18 39.7.48.7:6803/11185 227 : [WRN] slow request 120.149561 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write 114688~512] 3.841c0edf ondisk+write e994) v4
 currently waiting for subops from 3,37

2015-01-05 13:46:12.988010 osd.18 39.7.48.7:6803/11185 228 : [WRN] 3 slow requests, 1 included below; oldest blocked for > 723.661722 secs

2015-01-05 13:46:12.988017 osd.18 39.7.48.7:6803/11185 229 : [WRN] slow request 240.186772 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write 114688~512] 3.841c0edf ondisk+write e994) v4
 currently waiting for subops from 3,37

2015-01-05 13:46:18.971570 osd.14 39.7.48.7:6818/11640 253 : [WRN] 4 slow requests, 1 included below; oldest blocked for > 740.980083 secs

2015-01-05 13:46:18.971577 osd.14 39.7.48.7:6818/11640 254 : [WRN] slow request 480.063439 seconds old, received at 2015-01-05 13:38:18.908100: osd_op(client.91911.1:3113675 rb.0.13f5.238e1f29.000000000010 [write 475136~512] 3.679a939d ondisk+write e994) v4
 currently waiting for subops from 27,34

2015-01-05 13:48:05.030581 osd.14 39.7.48.7:6818/11640 255 : [WRN] 4 slow requests, 1 included below; oldest blocked for > 847.039098 secs

2015-01-05 13:48:05.030587 osd.14 39.7.48.7:6818/11640 256 : [WRN] slow request 480.198282 seconds old, received at 2015-01-05 13:40:04.832272: osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.000000000010 [write 475136~512] 3.5e623815 ondisk+write e994) v4
 currently waiting for subops from 27,33

iostat -dxm:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda            5662.99     3.63  155.17    3.37    22.94     0.03   296.72     1.05    6.64   1.79  28.32

sdb               0.73     0.04    3.62   70.66     0.04    10.85   300.04     0.06    0.79   0.08   0.61

sdc               0.70     0.00    1.70   65.45     0.01    10.34   315.60     0.06    0.85   0.09   0.57

sdd               0.07     0.95   16.78   13.47     1.03     2.28   224.25     0.37   12.32   3.84  11.62

sde               0.26     0.80   25.79   10.89     2.11     1.85   220.74     2.82   76.94   4.73  17.35

sdf               0.23     0.94   21.16   12.39     1.56     2.14   225.94     2.81   83.84   4.70  15.75

sdg               0.02     0.91    9.22   12.30     0.33     2.06   227.64     0.26   12.26   3.67   7.89

sdh               0.07     0.86   14.31   12.35     0.84     2.06   223.02     2.17   81.27   4.79  12.76

sdk               0.19     0.85   15.92   10.48     1.06     1.78   220.09     3.11  114.07   4.68  12.35

sdl               0.30     0.87   20.32   12.07     1.45     2.06   221.99     2.71   82.65   4.68  15.16

sdi               0.13     0.88   16.03   11.51     1.07     1.94   224.27     2.99  108.40   4.64  12.78

sdj               0.15     0.90   15.63   11.83     0.95     2.01   220.32     2.85  103.83   4.65  12.76

sdm               0.19     0.86   20.48   12.04     1.46     2.04   220.42     2.64   81.17   4.72  15.34

Our ceph cluster is 4x Dell R720xd nodes:

2x1TB spinners configured in RAID for the OS

10x4TB spinners for OSD's (XFS)

2x400GB SSD's, each with 5x~50GB OSD journals

2x Xeon E5-2620 CPU (/proc/cpuinfo reports 24 cores)

128GB RAM

Two networks (public+cluster), both over infiniband

Three monitors are configured on the first three nodes, and use a chunk of one of the SSDs for their data, on an XFS partition

Software:

SLES 11SP3, with some in house patching. (3.0.1 kernel, "ceph-client" backported from 3.10)

Ceph version: ceph-0.80.5-0.9.2, packaged by SUSE

ceph.conf:

fsid = 3e8dbfd8-c3c8-4d30-80e2-cd059619d757

mon initial members = tvsaq1, tvsaq2, tvsar1

mon host = 39.7.48.6, 39.7.48.7, 39.7.48.8

cluster network = 39.64.0.0/12

public network = 39.0.0.0/12

auth cluster required = cephx

auth service required = cephx

auth client required = cephx

osd journal size = 9000

filestore xattr use omap = true

osd crush update on start = false

osd pool default size = 3

osd pool default min size = 1

osd pool default pg num = 4096

osd pool default pgp num = 4096

mon clock drift allowed = .100

osd mount options xfs = rw,noatime,inode64

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com