Huge issues with slow requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 6 Sep 2014 19:47:13 +0200 Josef Johansson wrote:

> 
> On 06 Sep 2014, at 19:37, Josef Johansson <josef at oderland.se> wrote:
> 
> > Hi,
> > 
> > Unfortunatly the journal tuning did not do much. That?s odd, because I
> > don?t see much utilisation on OSDs themselves. Now this leads to a
> > network-issue between the OSDs right?
> > 
> To answer my own question. Restarted a bond and it all went up again,
> found the culprit ? packet loss. Everything up and running afterwards.
> 
If there were actual errors, that should have been visible in atop as well.
For utilization it isn't that obvious, as it doesn't know what bandwidth a
bond device has. Same is true for IPoIB interfaces.
And FWIW, tap (kvm guest interfaces) are wrongly pegged in the kernel at
10Mb/s, so they get to be falsely redlined on compute nodes all the time.

> I?ll be taking that beer now,

Skol.

Christian

> Regards,
> Josef
> > On 06 Sep 2014, at 18:17, Josef Johansson <josef at oderland.se> wrote:
> > 
> >> Hi,
> >> 
> >> On 06 Sep 2014, at 17:59, Christian Balzer <chibi at gol.com> wrote:
> >> 
> >>> 
> >>> Hello,
> >>> 
> >>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
> >>> 
> >>>> Hi,
> >>>> 
> >>>> On 06 Sep 2014, at 17:27, Christian Balzer <chibi at gol.com> wrote:
> >>>> 
> >>>>> 
> >>>>> Hello,
> >>>>> 
> >>>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> >>>>> 
> >>>>>> We manage to go through the restore, but the performance
> >>>>>> degradation is still there.
> >>>>>> 
> >>>>> Manifesting itself how?
> >>>>> 
> >>>> Awful slow io on the VMs, and iowait, it?s about 2MB/s or so.
> >>>> But mostly a lot of iowait.
> >>>> 
> >>> I was thinking about the storage nodes. ^^
> >>> As in, does a particular node or disk seem to be redlined all the
> >>> time?
> >> They?re idle, with little io wait.
> > It also shows it self as earlier, with slow requests now and then.
> > 
> > Like this 
> > 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN]
> > slow request 31.554785 seconds old, received at 2014-09-06
> > 19:12:56.914688: osd_op(client.12483520.0:12211087
> > rbd_data.4b8e9b3d1b58ba.0000000000001222 [stat,write 3813376~4096]
> > 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN]
> > slow request 31.554736 seconds old, received at 2014-09-06
> > 19:12:56.914737: osd_op(client.12483520.0:12211088
> > rbd_data.4b8e9b3d1b58ba.0000000000001222 [stat,write 3842048~8192]
> > 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN]
> > slow request 30.691760 seconds old, received at 2014-09-06
> > 19:12:57.777713: osd_op(client.12646408.0:36726433
> > rbd_data.81ab322eb141f2.000000000000ec38 [stat,write 749568~4096]
> > 3.7ae1c1da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN]
> > 23 slow requests, 2 included below; oldest blocked for > 42.196747
> > secs 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 :
> > [WRN] slow request 30.344653 seconds old, received at 2014-09-06
> > 19:13:01.125248: osd_op(client.18869229.0:100325
> > rbd_data.41d2eb2eb141f2.0000000000002732 [stat,write 2174976~4096]
> > 3.55d437e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN]
> > slow request 30.344579 seconds old, received at 2014-09-06
> > 19:13:01.125322: osd_op(client.18869229.0:100326
> > rbd_data.41d2eb2eb141f2.0000000000002732 [stat,write 2920448~4096]
> > 3.55d437e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN]
> > 24 slow requests, 1 included below; oldest blocked for > 43.196971
> > secs 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 :
> > [WRN] slow request 30.627252 seconds old, received at 2014-09-06
> > 19:13:01.842873: osd_op(client.10785413.0:136148901
> > rbd_data.96803f2eb141f2.00000000000033d7 [stat,write 4063232~4096]
> > 3.cf740399 e15861) v4 currently waiting for subops from [1,13]
> > 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN]
> > 27 slow requests, 3 included below; oldest blocked for > 48.197700
> > secs 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 :
> > [WRN] slow request 30.769509 seconds old, received at 2014-09-06
> > 19:13:06.701345: osd_op(client.18777372.0:1605468
> > rbd_data.2f1e4e2eb141f2.0000000000003541 [stat,write 1118208~4096]
> > 3.db1ca37e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN]
> > slow request 30.769458 seconds old, received at 2014-09-06
> > 19:13:06.701396: osd_op(client.18777372.0:1605469
> > rbd_data.2f1e4e2eb141f2.0000000000003541 [stat,write 1130496~4096]
> > 3.db1ca37e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN]
> > slow request 30.266843 seconds old, received at 2014-09-06
> > 19:13:07.204011: osd_op(client.18795696.0:847270
> > rbd_data.30532e2eb141f2.00000000000036bd [stat,write 3772416~4096]
> > 3.76f1df7e e15861) v4 currently waiting for subops from [13,6]
> > 2014-09-06 19:13:38.471152 osd.25 10.168.7.23:6827/11423 374 : [WRN]
> > 30 slow requests, 3 included below; oldest blocked for > 49.197952
> > secs 2014-09-06 19:13:38.471158 osd.25 10.168.7.23:6827/11423 375 :
> > [WRN] slow request 30.706236 seconds old, received at 2014-09-06
> > 19:13:07.764870: osd_op(client.12483523.0:36628673
> > rbd_data.4defd32eb141f2.0000000000015200 [stat,write 2121728~4096]
> > 3.cd82ed8a e15861) v4 currently waiting for subops from [0,13]
> > 2014-09-06 19:13:38.471162 osd.25 10.168.7.23:6827/11423 376 : [WRN]
> > slow request 30.695616 seconds old, received at 2014-09-06
> > 19:13:07.775490: osd_op(client.10785416.0:72721328
> > rbd_data.96808f2eb141f2.0000000000002a37 [stat,write 1507328~4096]
> > 3.323e11da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:38.471165 osd.25 10.168.7.23:6827/11423 377 : [WRN]
> > slow request 30.623025 seconds old, received at 2014-09-06
> > 19:13:07.848081: osd_op(client.18809749.0:315523
> > rbd_data.26fad12eb141f2.000000000000473a [stat,write 2248704~4096]
> > 3.26e839da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:39.471304 osd.25 10.168.7.23:6827/11423 378 : [WRN]
> > 31 slow requests, 1 included below; oldest blocked for > 50.198117
> > secs 2014-09-06 19:13:39.471310 osd.25 10.168.7.23:6827/11423 379 :
> > [WRN] slow request 30.749529 seconds old, received at 2014-09-06
> > 19:13:08.721742: osd_op(client.18780171.0:649232
> > rbd_data.55019a2eb141f2.000000000000097c [stat,write 188416~4096]
> > 3.c31859da e15861) v4 currently waiting for subops from [13,2]
> > 2014-09-06 19:13:40.471439 osd.25 10.168.7.23:6827/11423 380 : [WRN]
> > 32 slow requests, 1 included below; oldest blocked for > 51.198249
> > secs 2014-09-06 19:13:40.471445 osd.25 10.168.7.23:6827/11423 381 :
> > [WRN] slow request 30.842014 seconds old, received at 2014-09-06
> > 19:13:09.629389: osd_op(client.9281213.0:39090073
> > rbd_data.4e1ed72eb141f2.0000000000003949 [stat,write 397312~4096]
> > 3.94374f99 e15861) v4 currently waiting for subops from [1,13]
> > 2014-09-06 19:14:16.976270 osd.24 10.168.7.23:6819/10217 721 : [WRN] 1
> > slow requests, 1 included below; oldest blocked for > 30.706905 secs
> > 2014-09-06 19:14:16.976276 osd.24 10.168.7.23:6819/10217 722 : [WRN]
> > slow request 30.706905 seconds old, received at 2014-09-06
> > 19:13:46.269338: osd_op(client.18933091.0:413037
> > rbd_data.967e522eb141f2.000000000000f838 [stat,write 2584576~163840]
> > 3.54f712a6 e15861) v4 currently waiting for subops from [15,9]
> > 2014-09-06 19:14:25.977474 osd.24 10.168.7.23:6819/10217 723 : [WRN] 7
> > slow requests, 6 included below; oldest blocked for > 39.708092 secs
> > 2014-09-06 19:14:25.977480 osd.24 10.168.7.23:6819/10217 724 : [WRN]
> > slow request 30.833974 seconds old, received at 2014-09-06
> > 19:13:55.143456: osd_op(client.18371393.0:1892255
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 262144~8192]
> > 14.7569a69b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:25.977483 osd.24 10.168.7.23:6819/10217 725 : [WRN]
> > slow request 30.833841 seconds old, received at 2014-09-06
> > 19:13:55.143589: osd_op(client.18371393.0:1892256
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 274432~4096]
> > 14.7569a69b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:25.977486 osd.24 10.168.7.23:6819/10217 726 : [WRN]
> > slow request 30.833735 seconds old, received at 2014-09-06
> > 19:13:55.143695: osd_op(client.18371393.0:1892257
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 282624~4096]
> > 14.7569a69b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:25.977490 osd.24 10.168.7.23:6819/10217 727 : [WRN]
> > slow request 30.833664 seconds old, received at 2014-09-06
> > 19:13:55.143766: osd_op(client.18371393.0:1892258
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 299008~8192]
> > 14.7569a69b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:25.977493 osd.24 10.168.7.23:6819/10217 728 : [WRN]
> > slow request 30.833570 seconds old, received at 2014-09-06
> > 19:13:55.143860: osd_op(client.18371393.0:1892259
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 319488~4096]
> > 14.7569a69b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:18.955571 osd.19 10.168.7.23:6806/7701 384 : [WRN] 2
> > slow requests, 2 included below; oldest blocked for > 30.915626 secs
> > 2014-09-06 19:14:18.955578 osd.19 10.168.7.23:6806/7701 385 : [WRN]
> > slow request 30.915626 seconds old, received at 2014-09-06
> > 19:13:48.039909: osd_op(client.18781018.0:1953063
> > rbd_data.4d02282eb141f2.0000000000000ce9 [stat,write 2121728~2072576]
> > 3.b68613f e15861) v4 currently waiting for subops from [37,20]
> > 2014-09-06 19:14:18.955581 osd.19 10.168.7.23:6806/7701 386 : [WRN]
> > slow request 30.045546 seconds old, received at 2014-09-06
> > 19:13:48.909989: osd_op(client.18516181.0:8356940
> > rbd_data.4700682eb141f2.000000000000003c [stat,write 1851392~16384]
> > 3.d5b87cff e15861) v4 currently waiting for subops from [8,37]
> > 2014-09-06 19:14:26.977710 osd.24 10.168.7.23:6819/10217 729 : [WRN] 7
> > slow requests, 1 included below; oldest blocked for > 40.708340 secs
> > 2014-09-06 19:14:26.977715 osd.24 10.168.7.23:6819/10217 730 : [WRN]
> > slow request 31.081591 seconds old, received at 2014-09-06
> > 19:13:55.896087: osd_op(client.12483526.0:35096707
> > rbd_data.4b8ea12eb141f2.000000000000415a [stat,write 4169728~4096]
> > 3.e8e9faa6 e15861) v4 currently waiting for subops from [15,9]
> > 2014-09-06 19:14:32.978731 osd.24 10.168.7.23:6819/10217 731 : [WRN] 8
> > slow requests, 1 included below; oldest blocked for > 46.709336 secs
> > 2014-09-06 19:14:32.978738 osd.24 10.168.7.23:6819/10217 732 : [WRN]
> > slow request 30.115926 seconds old, received at 2014-09-06
> > 19:14:02.862748: osd_op(client.18780177.0:320536
> > rbd_data.5501a62eb141f2.0000000000000057 [stat,write 1327104~4096]
> > 3.bbb832a6 e15861) v4 currently waiting for subops from [15,9]
> > 2014-09-06 19:14:35.979239 osd.24 10.168.7.23:6819/10217 733 : [WRN]
> > 14 slow requests, 6 included below; oldest blocked for > 49.709837
> > secs 2014-09-06 19:14:35.979244 osd.24 10.168.7.23:6819/10217 734 :
> > [WRN] slow request 30.825477 seconds old, received at 2014-09-06
> > 19:14:05.153698: osd_op(client.18371393.0:1892404
> > rbd_data.117d9df2ae8944a.0000000000000142 [write 3932160~4096]
> > 14.4822de9b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:35.979248 osd.24 10.168.7.23:6819/10217 735 : [WRN]
> > slow request 30.825370 seconds old, received at 2014-09-06
> > 19:14:05.153805: osd_op(client.18371393.0:1892405
> > rbd_data.117d9df2ae8944a.0000000000000142 [write 3940352~8192]
> > 14.4822de9b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:35.979251 osd.24 10.168.7.23:6819/10217 736 : [WRN]
> > slow request 30.825236 seconds old, received at 2014-09-06
> > 19:14:05.153939: osd_op(client.18371393.0:1892406
> > rbd_data.117d9df2ae8944a.0000000000000142 [write 3960832~4096]
> > 14.4822de9b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:35.979253 osd.24 10.168.7.23:6819/10217 737 : [WRN]
> > slow request 30.823889 seconds old, received at 2014-09-06
> > 19:14:05.155286: osd_op(client.18371393.0:1892420
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 0~4096] 14.7569a69b
> > e15861) v4 currently waiting for subops from [15] 2014-09-06
> > 19:14:35.979258 osd.24 10.168.7.23:6819/10217 738 : [WRN] slow request
> > 30.823785 seconds old, received at 2014-09-06 19:14:05.155390:
> > osd_op(client.18371393.0:1892421
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 8192~8192]
> > 14.7569a69b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:14:36.979441 osd.24 10.168.7.23:6819/10217 739 : [WRN]
> > 14 slow requests, 1 included below; oldest blocked for > 50.710081
> > secs 2014-09-06 19:14:36.979446 osd.24 10.168.7.23:6819/10217 740 :
> > [WRN] slow request 31.823918 seconds old, received at 2014-09-06
> > 19:14:05.155501: osd_op(client.18371393.0:1892422
> > rbd_data.117d9df2ae8944a.0000000000000141 [write 28672~4096]
> > 14.7569a69b e15861) v4 currently waiting for subops from [15]
> > 2014-09-06 19:15:24.770188 osd.28 10.168.7.23:6843/14566 774 : [WRN] 1
> > slow requests, 1 included below; oldest blocked for > 30.206086 secs
> > 2014-09-06 19:15:24.770194 osd.28 10.168.7.23:6843/14566 775 : [WRN]
> > slow request 30.206086 seconds old, received at 2014-09-06
> > 19:14:54.564079: osd_op(client.9266625.0:147034227
> > rbd_data.42d6792eb141f2.000000000000f83f [stat,write 0~4194304]
> > 3.4ff6bd9b e15861) v4 currently waiting for subops from [38,13]
> > 2014-09-06 19:15:26.770435 osd.28 10.168.7.23:6843/14566 776 : [WRN] 2
> > slow requests, 1 included below; oldest blocked for > 32.206328 secs
> > 2014-09-06 19:15:26.770440 osd.28 10.168.7.23:6843/14566 777 : [WRN]
> > slow request 30.663403 seconds old, received at 2014-09-06
> > 19:14:56.107004: osd_op(client.18780574.0:1021249
> > rbd_data.967d562eb141f2.00000000000042c0 [stat,write 3117056~4096]
> > 3.186b7c1 e15861) v4 currently waiting for subops from [44,38]
> > 2014-09-06 19:24:14.883152 osd.23 10.168.7.23:6814/9627 386 : [WRN] 1
> > slow requests, 1 included below; oldest blocked for > 30.861143 secs
> > 2014-09-06 19:24:14.883156 osd.23 10.168.7.23:6814/9627 387 : [WRN]
> > slow request 30.861143 seconds old, received at 2014-09-06
> > 19:23:44.021986: osd_op(client.18781018.0:2004280
> > rbd_data.4d02282eb141f2.0000000000000ba8 [stat,write 2506752~380928]
> > 3.7a4382b e15861) v4 currently waiting for subops from [33,1]
> > 2014-09-06 19:25:22.837826 osd.28 10.168.7.23:6843/14566 779 : [WRN] 1
> > slow requests, 1 included below; oldest blocked for > 30.127923 secs
> > 2014-09-06 19:25:22.837837 osd.28 10.168.7.23:6843/14566 780 : [WRN]
> > slow request 30.127923 seconds old, received at 2014-09-06
> > 19:24:52.709881: osd_op(client.18781018.0:2010421
> > rbd_data.4d02282eb141f2.0000000000000c9f [stat,write 1282048~2912256]
> > 3.61cd6bc1 e15861) v4 currently waiting for subops from [44,38]
> > 2014-09-06 19:26:05.035047 osd.19 10.168.7.23:6806/7701 388 : [WRN] 1
> > slow requests, 1 included below; oldest blocked for > 30.488605 secs
> > 2014-09-06 19:26:05.035052 osd.19 10.168.7.23:6806/7701 389 : [WRN]
> > slow request 30.488605 seconds old, received at 2014-09-06
> > 19:25:34.546419: osd_op(client.18781018.0:2027547
> > rbd_data.4d02282eb141f2.0000000000000ce9 [stat,write 2318336~1875968]
> > 3.b68613f e15861) v4 currently waiting for subops from [37,20]
> > 2014-09-06 19:27:55.381526 osd.22 10.168.7.23:6810/8748 1415 : [WRN] 1
> > slow requests, 1 included below; oldest blocked for > 30.675678 secs
> > 2014-09-06 19:27:55.381531 osd.22 10.168.7.23:6810/8748 1416 : [WRN]
> > slow request 30.675678 seconds old, received at 2014-09-06
> > 19:27:24.705808: osd_op(client.18516184.0:12873003
> > rbd_data.47006e2eb141f2.000000000000f839 [stat,write 1142784~1310720]
> > 3.8520ce00 e15861) v4 currently waiting for subops from [49,46]
> > 2014-09-06 19:27:57.381832 osd.22 10.168.7.23:6810/8748 1417 : [WRN] 2
> > slow requests, 1 included below; oldest blocked for > 32.676003 secs
> > 2014-09-06 19:27:57.381837 osd.22 10.168.7.23:6810/8748 1418 : [WRN]
> > slow request 30.559596 seconds old, received at 2014-09-06
> > 19:27:26.822215: osd_op(client.12929758.0:55398694
> > rbd_data.b162f22eb141f2.0000000000005070 [stat,write 3178496~16384]
> > 3.320d72fc e15861) v4 currently waiting for subops from [46,34]
> > 2014-09-06 19:28:00.382119 osd.22 10.168.7.23:6810/8748 1419 : [WRN] 3
> > slow requests, 1 included below; oldest blocked for > 35.676289 secs
> > 2014-09-06 19:28:00.382124 osd.22 10.168.7.23:6810/8748 1420 : [WRN]
> > slow request 30.511158 seconds old, received at 2014-09-06
> > 19:27:29.870939: osd_op(client.9757754.0:74509765
> > rbd_data.8a8ed52eb141f2.0000000000003bd3 [stat,write 3407872~4096]
> > 3.5feb5e00 e15861) v4 currently waiting for subops from [49,46]
> > 
> > Did the obvious report as well
> > 
> > sudo grep 'slow request' /var/log/ceph/ceph.log | awk '{print $3}' |
> > sort | uniq -c | sort -n | tail 1143 osd.4
> >    1268 osd.19
> >    2007 osd.25
> >    2183 osd.26
> >    2229 osd.27
> >    2783 osd.24
> >    4037 osd.17
> >    4137 osd.23
> >    4174 osd.22
> >    4986 osd.28
> > 
> > It doesn?t seem that I actually gain much by stopping the most
> > referred OSD.
> > 
> > I read somewhere that you actually could have a object-file that
> > messed everything up, so when the OSD was running, you had very low
> > performance. No idea how to find that though.
> > 
> > Regards,
> > Josef
> > 
> >>> 
> >>>>>> Looking through the OSDs to pinpoint a source of the degradation
> >>>>>> and hoping the current load will be lowered.
> >>>>>> 
> >>>>> 
> >>>>> You're the one looking at your cluster, the iostat, atop, iotop and
> >>>>> whatnot data.
> >>>>> If one particular OSD/disk stands out, investigate it, as per the
> >>>>> "Good way to monitor detailed latency/throughput" thread. 
> >>>>> 
> >>>> Will read it through.
> >>>>> If you have a spare and idle machine that is identical to your
> >>>>> storage nodes, you could run a fio benchmark on a disk there and
> >>>>> then compare the results to that of your suspect disk after
> >>>>> setting your cluster to noout and stopping that particular OSD.
> >>>> No spare though, but I have a rough idea what it should be, what?s
> >>>> I?m going at right now. Right, so the cluster should be fine after
> >>>> I stop the OSD right? I though of stopping it a little bit to see
> >>>> if the IO was better afterwards from within the VMs. Not sure how
> >>>> good effect it makes though since it may be waiting for the IO to
> >>>> complete what not.
> >>>>> 
> >>> If you set your cluster to noout, as in "ceph osd set noout" per
> >>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
> >>> before shutting down a particular ODS, no data migration will happen.
> >>> 
> >>> Of course you will want to shut it down as little as possible, so
> >>> that recovery traffic when it comes back is minimized. 
> >>> 
> >> Good, yes will do this.
> >> Regards,
> >> Josef
> >>> Christian 
> >>> 
> >>>>>> I?m a bit afraid of doing the 0 to weight of an OSD, wouldn?t it
> >>>>>> be tough if the degradation is still there afterwards? i.e. if I
> >>>>>> set back the weight would it move back all the PGs?
> >>>>>> 
> >>>>> Of course.
> >>>>> 
> >>>>> Until you can determine that a specific OSD/disk is the culprit,
> >>>>> don't do that. 
> >>>>> If you have the evidence, go ahead.
> >>>>> 
> >>>> Great, that?s what I though as well.
> >>>>> Regards,
> >>>>> 
> >>>>> Christian
> >>>>> 
> >>>>>> Regards,
> >>>>>> Josef
> >>>>>> 
> >>>>>> On 06 Sep 2014, at 15:52, Josef Johansson <josef at oderland.se>
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> FWI I did restart the OSDs until I saw a server that made impact.
> >>>>>>> Until that server stopped doing impact, I didn?t get lower in the
> >>>>>>> number objects being degraded. After a while it was done with
> >>>>>>> recovering that OSD and happily started with others. I guess I
> >>>>>>> will be seeing the same behaviour when it gets to replicating
> >>>>>>> the same PGs that were causing troubles the first time.
> >>>>>>> 
> >>>>>>> On 06 Sep 2014, at 15:04, Josef Johansson <josef at oderland.se>
> >>>>>>> wrote:
> >>>>>>> 
> >>>>>>>> Actually, it only worked with restarting  for a period of time
> >>>>>>>> to get the recovering process going. Can?t get passed the 21k
> >>>>>>>> object mark.
> >>>>>>>> 
> >>>>>>>> I?m uncertain if the disk really is messing this up right now as
> >>>>>>>> well. So I?m not glad to start moving 300k objects around.
> >>>>>>>> 
> >>>>>>>> Regards,
> >>>>>>>> Josef
> >>>>>>>> 
> >>>>>>>> On 06 Sep 2014, at 14:33, Josef Johansson <josef at oderland.se>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Hi,
> >>>>>>>>> 
> >>>>>>>>> On 06 Sep 2014, at 13:53, Christian Balzer <chibi at gol.com>
> >>>>>>>>> wrote:
> >>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> Hello,
> >>>>>>>>>> 
> >>>>>>>>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> Also putting this on the list.
> >>>>>>>>>>> 
> >>>>>>>>>>> On 06 Sep 2014, at 13:36, Josef Johansson <josef at oderland.se>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Same issues again, but I think we found the drive that
> >>>>>>>>>>>> causes the problems.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> But this is causing problems as it?s trying to do a recover
> >>>>>>>>>>>> to that osd at the moment.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> So we?re left with the status message 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860
> >>>>>>>>>>>> pgs: 6841 active+clean, 19 active+remapped+backfilling;
> >>>>>>>>>>>> 12299 GB data, 36882 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>> 1921KB/s rd, 192KB/s wr, 74op/s; 41424/15131923 degraded
> >>>>>>>>>>>> (0.274%); recovering 0 o/s, 2035KB/s
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> It?s improving, but way too slowly. If I restart the
> >>>>>>>>>>>> recovery (ceph osd set no recovery /unset) it doesn?t
> >>>>>>>>>>>> change the osd what I can see.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Any ideas?
> >>>>>>>>>>>> 
> >>>>>>>>>> I don't know the state of your cluster, i.e. what caused the
> >>>>>>>>>> recovery to start (how many OSDs went down?).
> >>>>>>>>> Performance degradation, databases are the worst impacted. It?s
> >>>>>>>>> actually a OSD that we put in that?s causing it (removed it
> >>>>>>>>> again though). So the cluster in itself is healthy.
> >>>>>>>>> 
> >>>>>>>>>> If you have a replication of 3 and only one OSD was involved,
> >>>>>>>>>> what is stopping you from taking that wonky drive/OSD out?
> >>>>>>>>>> 
> >>>>>>>>> There?s data that goes missing if I do that, I guess I have to
> >>>>>>>>> wait for the recovery process to complete before I can go any
> >>>>>>>>> further, this is with rep 3.
> >>>>>>>>>> If you don't know that or want to play it safe, how about
> >>>>>>>>>> setting the weight of that OSD to 0? 
> >>>>>>>>>> While that will AFAICT still result in all primary PGs to be
> >>>>>>>>>> evacuated off it, no more writes will happen to it and reads
> >>>>>>>>>> might be faster. In either case, it shouldn't slow down the
> >>>>>>>>>> rest of your cluster anymore.
> >>>>>>>>>> 
> >>>>>>>>> That?s actually one idea I haven?t thought off, I wan?t to
> >>>>>>>>> play it safe right now and hope that it goes up again, I
> >>>>>>>>> actually found one wonky way of getting the recovery process
> >>>>>>>>> from not stalling to a grind, and that was restarting OSDs.
> >>>>>>>>> One at the time.
> >>>>>>>>> 
> >>>>>>>>> Regards,
> >>>>>>>>> Josef
> >>>>>>>>>> Regards,
> >>>>>>>>>> 
> >>>>>>>>>> Christian
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Josef
> >>>>>>>>>>>> 
> >>>>>>>>>>>> On 05 Sep 2014, at 11:26, Luis Periquito
> >>>>>>>>>>>> <luis.periquito at ocado.com> wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> Only time I saw such behaviour was when I was deleting a
> >>>>>>>>>>>>> big chunk of data from the cluster: all the client
> >>>>>>>>>>>>> activity was reduced, the op/s were almost non-existent
> >>>>>>>>>>>>> and there was unjustified delays all over the cluster. But
> >>>>>>>>>>>>> all the disks were somewhat busy in atop/iotstat.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> On 5 September 2014 09:51, David <david at visions.se> wrote:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Indeed strange.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> That output was when we had issues, seems that most
> >>>>>>>>>>>>> operations were blocked / slow requests.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> A ?baseline? output is more like today:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 9273KB/s rd, 24650KB/s wr, 2755op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:30.125637 mon.0 [INF] pgmap v12582760:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s wr, 2294op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:32.144945 mon.0 [INF] pgmap v12582762:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 12511KB/s rd, 15739KB/s wr,
> >>>>>>>>>>>>> 2488op/s 2014-09-05 10:44:33.161210 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582763: 6860 pgs: 6860 active+clean; 12253 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd,
> >>>>>>>>>>>>> 14880KB/s wr, 2609op/s 2014-09-05 10:44:34.187294 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582764: 6860 pgs: 6860 active+clean; 12253
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 17720KB/s
> >>>>>>>>>>>>> rd, 22964KB/s wr, 3257op/s 2014-09-05 10:44:35.190785
> >>>>>>>>>>>>> mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 active+clean;
> >>>>>>>>>>>>> 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>>> 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05
> >>>>>>>>>>>>> 10:44:36.213535 mon.0 [INF] pgmap v12582766: 6860 pgs:
> >>>>>>>>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> >>>>>>>>>>>>> 178 TB avail; 17630KB/s rd, 18855KB/s wr, 3131op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 12262KB/s rd, 18627KB/s wr,
> >>>>>>>>>>>>> 2595op/s 2014-09-05 10:44:38.233357 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582768: 6860 pgs: 6860 active+clean; 12253 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 17697KB/s rd,
> >>>>>>>>>>>>> 17572KB/s wr, 2156op/s 2014-09-05 10:44:39.239409 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582769: 6860 pgs: 6860 active+clean; 12253
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 20300KB/s
> >>>>>>>>>>>>> rd, 19735KB/s wr, 2197op/s 2014-09-05 10:44:40.260423
> >>>>>>>>>>>>> mon.0 [INF] pgmap v12582770: 6860 pgs: 6860 active+clean;
> >>>>>>>>>>>>> 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>>> 14656KB/s rd, 15460KB/s wr, 2199op/s 2014-09-05
> >>>>>>>>>>>>> 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860 pgs:
> >>>>>>>>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> >>>>>>>>>>>>> 178 TB avail; 8969KB/s rd, 11918KB/s wr, 1951op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:42.276192 mon.0 [INF] pgmap v12582772:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 7272KB/s rd, 10644KB/s wr, 1832op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:43.291817 mon.0 [INF] pgmap v12582773:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 9316KB/s rd, 16610KB/s wr, 2412op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:44.295469 mon.0 [INF] pgmap v12582774:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 9257KB/s rd, 19953KB/s wr, 2633op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:45.315774 mon.0 [INF] pgmap v12582775:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 9718KB/s rd, 14298KB/s wr, 2101op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:46.326783 mon.0 [INF] pgmap v12582776:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 20877KB/s rd, 12822KB/s wr,
> >>>>>>>>>>>>> 2447op/s 2014-09-05 10:44:47.327537 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582777: 6860 pgs: 6860 active+clean; 12253 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 18447KB/s rd,
> >>>>>>>>>>>>> 12945KB/s wr, 2226op/s 2014-09-05 10:44:48.348725 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582778: 6860 pgs: 6860 active+clean; 12253
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 15514KB/s
> >>>>>>>>>>>>> rd, 19366KB/s wr, 1892op/s 2014-09-05 10:44:49.357537
> >>>>>>>>>>>>> mon.0 [INF] pgmap v12582779: 6860 pgs: 6860 active+clean;
> >>>>>>>>>>>>> 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>>> 16640KB/s rd, 19694KB/s wr, 1887op/s 2014-09-05
> >>>>>>>>>>>>> 10:44:50.365016 mon.0 [INF] pgmap v12582780: 6860 pgs:
> >>>>>>>>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> >>>>>>>>>>>>> 178 TB avail; 9139KB/s rd, 10734KB/s wr, 1566op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:51.385553 mon.0 [INF] pgmap v12582781:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 11493KB/s rd, 13827KB/s wr,
> >>>>>>>>>>>>> 1928op/s 2014-09-05 10:44:52.389189 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582782: 6860 pgs: 6860 active+clean; 12253 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 12922KB/s rd,
> >>>>>>>>>>>>> 12814KB/s wr, 1670op/s 2014-09-05 10:44:53.403735 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582783: 6860 pgs: 6860 active+clean; 12253
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 14880KB/s
> >>>>>>>>>>>>> rd, 17532KB/s wr, 2145op/s 2014-09-05 10:44:54.410581
> >>>>>>>>>>>>> mon.0 [INF] pgmap v12582784: 6860 pgs: 6860 active+clean;
> >>>>>>>>>>>>> 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>>> 15419KB/s rd, 19696KB/s wr, 2259op/s 2014-09-05
> >>>>>>>>>>>>> 10:44:55.420308 mon.0 [INF] pgmap v12582785: 6860 pgs:
> >>>>>>>>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> >>>>>>>>>>>>> 178 TB avail; 10964KB/s rd, 15739KB/s wr, 1701op/s
> >>>>>>>>>>>>> 2014-09-05 10:44:56.445017 mon.0 [INF] pgmap v12582786:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 16195KB/s rd, 15554KB/s wr,
> >>>>>>>>>>>>> 2444op/s 2014-09-05 10:44:57.452136 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582787: 6860 pgs: 6860 active+clean; 12253 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 15016KB/s rd,
> >>>>>>>>>>>>> 16356KB/s wr, 2358op/s 2014-09-05 10:44:58.465958 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582788: 6860 pgs: 6860 active+clean; 12253
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 11668KB/s
> >>>>>>>>>>>>> rd, 18443KB/s wr, 2029op/s 2014-09-05 10:44:59.483462
> >>>>>>>>>>>>> mon.0 [INF] pgmap v12582789: 6860 pgs: 6860 active+clean;
> >>>>>>>>>>>>> 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>>> 12253KB/s rd, 10846KB/s wr, 1529op/s 2014-09-05
> >>>>>>>>>>>>> 10:45:00.492322 mon.0 [INF] pgmap v12582790: 6860 pgs:
> >>>>>>>>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> >>>>>>>>>>>>> 178 TB avail; 12247KB/s rd, 7084KB/s wr, 1464op/s
> >>>>>>>>>>>>> 2014-09-05 10:45:01.516581 mon.0 [INF] pgmap v12582791:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 16460KB/s rd, 12089KB/s wr,
> >>>>>>>>>>>>> 2537op/s 2014-09-05 10:45:02.527110 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582792: 6860 pgs: 6860 active+clean; 12253 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 13382KB/s rd,
> >>>>>>>>>>>>> 15080KB/s wr, 2563op/s 2014-09-05 10:45:03.538090 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582793: 6860 pgs: 6860 active+clean; 12253
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 10902KB/s
> >>>>>>>>>>>>> rd, 18745KB/s wr, 2863op/s 2014-09-05 10:45:04.558261
> >>>>>>>>>>>>> mon.0 [INF] pgmap v12582794: 6860 pgs: 6860 active+clean;
> >>>>>>>>>>>>> 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>>> 10850KB/s rd, 15995KB/s wr, 2695op/s 2014-09-05
> >>>>>>>>>>>>> 10:45:05.565750 mon.0 [INF] pgmap v12582795: 6860 pgs:
> >>>>>>>>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
> >>>>>>>>>>>>> 178 TB avail; 9636KB/s rd, 13262KB/s wr, 2372op/s
> >>>>>>>>>>>>> 2014-09-05 10:45:06.593984 mon.0 [INF] pgmap v12582796:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 18808KB/s rd, 19329KB/s wr,
> >>>>>>>>>>>>> 3819op/s 2014-09-05 10:45:07.595866 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582797: 6860 pgs: 6860 active+clean; 12253 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 21265KB/s rd,
> >>>>>>>>>>>>> 20743KB/s wr, 3861op/s 2014-09-05 10:45:08.624949 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582798: 6860 pgs: 6860 active+clean; 12254
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 20114KB/s
> >>>>>>>>>>>>> rd, 18543KB/s wr, 3248op/s 2014-09-05 10:45:09.627901
> >>>>>>>>>>>>> mon.0 [INF] pgmap v12582799: 6860 pgs: 6860 active+clean;
> >>>>>>>>>>>>> 12254 GB data, 36574 GB used, 142 TB / 178 TB avail;
> >>>>>>>>>>>>> 14717KB/s rd, 15141KB/s wr, 2302op/s 2014-09-05
> >>>>>>>>>>>>> 10:45:10.643234 mon.0 [INF] pgmap v12582800: 6860 pgs:
> >>>>>>>>>>>>> 6860 active+clean; 12254 GB data, 36574 GB used, 142 TB /
> >>>>>>>>>>>>> 178 TB avail; 8328KB/s rd, 13950KB/s wr, 1919op/s
> >>>>>>>>>>>>> 2014-09-05 10:45:11.651602 mon.0 [INF] pgmap v12582801:
> >>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12254 GB data, 36574 GB used,
> >>>>>>>>>>>>> 142 TB / 178 TB avail; 16978KB/s rd, 15921KB/s wr,
> >>>>>>>>>>>>> 3377op/s 2014-09-05 10:45:12.674819 mon.0 [INF] pgmap
> >>>>>>>>>>>>> v12582802: 6860 pgs: 6860 active+clean; 12254 GB data,
> >>>>>>>>>>>>> 36574 GB used, 142 TB / 178 TB avail; 16471KB/s rd,
> >>>>>>>>>>>>> 14034KB/s wr, 3379op/s 2014-09-05 10:45:13.688080 mon.0
> >>>>>>>>>>>>> [INF] pgmap v12582803: 6860 pgs: 6860 active+clean; 12254
> >>>>>>>>>>>>> GB data, 36574 GB used, 142 TB / 178 TB avail; 16149KB/s
> >>>>>>>>>>>>> rd, 12657KB/s wr, 2734op/s
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Aye, we actually saw latency on the disks go up a bit when
> >>>>>>>>>>>>> we had 128GB of RAM on the OSDs and decided to beef them
> >>>>>>>>>>>>> up to 256GB which helped. They?re running different
> >>>>>>>>>>>>> workloads (shared hosting) but we?ve never encountered the
> >>>>>>>>>>>>> issue we had yesterday even during our
> >>>>>>>>>>>>> testing/benchmarking.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Kind Regards,
> >>>>>>>>>>>>> David
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 5 sep 2014 kl. 09:05 skrev Christian Balzer
> >>>>>>>>>>>>> <chibi at gol.com>:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Fri, 5 Sep 2014 08:26:47 +0200 David wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Sorry for the lack of information yesterday, this was
> >>>>>>>>>>>>>>> "solved" after some 30 minutes, after having
> >>>>>>>>>>>>>>> reloaded/restarted all osd daemons. Unfortunately we
> >>>>>>>>>>>>>>> couldn?t pin point it to a single OSD or drive, all
> >>>>>>>>>>>>>>> drives seemed ok, some had a bit higher latency and we
> >>>>>>>>>>>>>>> tried to out / in them to see if it had any effect which
> >>>>>>>>>>>>>>> it did not.
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is odd.
> >>>>>>>>>>>>>> Having it "fixed" by restarting all OSDs would suggest
> >>>>>>>>>>>>>> either a software problem (bug) with Ceph or some
> >>>>>>>>>>>>>> resource other than the storage system being starved. But
> >>>>>>>>>>>>>> memory seems unlikely, even with bloated, leaking OSD
> >>>>>>>>>>>>>> daemon. And CPU seems even less likely.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> The cluster consists of 3 mon servrers, 5 OSD servers
> >>>>>>>>>>>>>>> with 10 enterprise HDDs backed with 2 S3700 SSDs for
> >>>>>>>>>>>>>>> journals each. OSD servers have 256GB of RAM, 2x E5-2630
> >>>>>>>>>>>>>>> v2 @ 2.60GHz CPUs.
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> The log that I posted yesterday was just a small taste
> >>>>>>>>>>>>>>> of the full one ;) They were all pointing to different
> >>>>>>>>>>>>>>> osd?s that they were waiting for. We?re also monitoring
> >>>>>>>>>>>>>>> all of the VMs running on KVM, and we didn?t see any
> >>>>>>>>>>>>>>> exceptional throughput or iops usage before or during
> >>>>>>>>>>>>>>> this event. We were checking iostat etc and nothing was
> >>>>>>>>>>>>>>> out of the ordinary..
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Going to double check SMART and also see if we can off
> >>>>>>>>>>>>>>> load some of the cluster in any way. If you have any
> >>>>>>>>>>>>>>> other advice that?d be appreciated :)
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> If the mon output below is typical for your cluster
> >>>>>>>>>>>>>> (634KB/s rd, 487KB/s wr, 90op/s) it should be able to
> >>>>>>>>>>>>>> handle that w/o breaking a sweat. I guestimate it at
> >>>>>>>>>>>>>> about 13K write IOPS and your reads should (after a
> >>>>>>>>>>>>>> while) come mostly from the nice big pagecache you have
> >>>>>>>>>>>>>> on those 256GB RAM machines.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Christian
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Thanks for your help!
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Kind Regards,
> >>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> 5 sep 2014 kl. 07:30 skrev Martin B Nielsen
> >>>>>>>>>>>>>>> <martin at unity3d.com>:
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Just echoing what Christian said.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Also, iirc the "currently waiting for subobs on [" could
> >>>>>>>>>>>>>>>> also mean a problem on those as it waits for ack from
> >>>>>>>>>>>>>>>> them (I might remember wrong).
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> If that is the case you might want to check in on osd
> >>>>>>>>>>>>>>>> 13 & 37 as well.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> With the cluster load and size you should not have this
> >>>>>>>>>>>>>>>> problem; I'm pretty sure you're dealing with a
> >>>>>>>>>>>>>>>> rogue/faulty osd/node somewhere.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>> Martin
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 2:28 AM, Christian Balzer
> >>>>>>>>>>>>>>>> <chibi at gol.com> wrote: On Thu, 4 Sep 2014 12:02:13 +0200
> >>>>>>>>>>>>>>>> David wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> We?re running a ceph cluster with version:
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> 0.67.7-1~bpo70+1
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> All of a sudden we?re having issues with the cluster
> >>>>>>>>>>>>>>>>> (running RBD images for kvm) with slow requests on all
> >>>>>>>>>>>>>>>>> of the OSD servers. Any idea why and how to fix it?
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> You give us a Ceph version at least, but for anybody to
> >>>>>>>>>>>>>>>> make guesses we need much more information than a log
> >>>>>>>>>>>>>>>> spew.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> How many nodes/OSDs, OS, hardware, OSD details (FS,
> >>>>>>>>>>>>>>>> journals on SSDs), etc.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Run atop (in a sufficiently large terminal) on all your
> >>>>>>>>>>>>>>>> nodes, see if you can spot a bottleneck, like a disk
> >>>>>>>>>>>>>>>> being at 100% all the time with a much higher avio than
> >>>>>>>>>>>>>>>> the others. Looking at your logs, I'd pay particular
> >>>>>>>>>>>>>>>> attention to the disk holding osd.22.
> >>>>>>>>>>>>>>>> A single slow disk can bring a whole large cluster to a
> >>>>>>>>>>>>>>>> crawl. If you're using a hardware controller with a
> >>>>>>>>>>>>>>>> battery backed up cache, check if that is fine, loss of
> >>>>>>>>>>>>>>>> the battery would switch from writeback to writethrough
> >>>>>>>>>>>>>>>> and massively slow down IOPS.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Christian
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> 2014-09-04 11:56:35.868521 mon.0 [INF] pgmap v12504451:
> >>>>>>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12163 GB data, 36308 GB
> >>>>>>>>>>>>>>>>> used, 142 TB / 178 TB avail; 634KB/s rd, 487KB/s wr,
> >>>>>>>>>>>>>>>>> 90op/s 2014-09-04 11:56:29.510270 osd.22 [WRN] 15 slow
> >>>>>>>>>>>>>>>>> requests, 1 included below; oldest blocked for
> >>>>>>>>>>>>>>>>>> 44.745754 secs 2014-09-04 11:56:29.510276 osd.22 [WRN]
> >>>>>>>>>>>>>>>>>> slow request 30.999821 seconds
> >>>>>>>>>>>>>>>>> old, received at 2014-09-04 11:55:58.510424:
> >>>>>>>>>>>>>>>>> osd_op(client.10731617.0:81868956
> >>>>>>>>>>>>>>>>> rbd_data.967e022eb141f2.0000000000000e72 [write
> >>>>>>>>>>>>>>>>> 0~4194304] 3.c585cebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:30.510528 osd.22
> >>>>>>>>>>>>>>>>> [WRN] 21 slow requests, 6 included below; oldest
> >>>>>>>>>>>>>>>>> blocked for > 45.745989 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:30.510534 osd.22 [WRN] slow request 30.122555
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:00.387925:
> >>>>>>>>>>>>>>>>> osd_op(client.13425082.0:11962345
> >>>>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
> >>>>>>>>>>>>>>>>> 1114112~8192] 3.c9e49140 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [13,42] 2014-09-04 11:56:30.510537
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 30.122362 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.388118:
> >>>>>>>>>>>>>>>>> osd_op(client.13425082.0:11962352
> >>>>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
> >>>>>>>>>>>>>>>>> 1134592~4096] 3.c9e49140 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [13,42] 2014-09-04 11:56:30.510541
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 30.122298 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.388182:
> >>>>>>>>>>>>>>>>> osd_op(client.13425082.0:11962353
> >>>>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
> >>>>>>>>>>>>>>>>> 4046848~8192] 3.c9e49140 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [13,42] 2014-09-04 11:56:30.510544
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 30.121577 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.388903:
> >>>>>>>>>>>>>>>>> osd_op(client.13425082.0:11962374
> >>>>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write
> >>>>>>>>>>>>>>>>> 2527232~4096] 3.cd9a9015 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [45,1] 2014-09-04 11:56:30.510548
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 30.121518 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.388962:
> >>>>>>>>>>>>>>>>> osd_op(client.13425082.0:11962375
> >>>>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write
> >>>>>>>>>>>>>>>>> 3133440~4096] 3.cd9a9015 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [45,1] 2014-09-04 11:56:31.510706
> >>>>>>>>>>>>>>>>> osd.22 [WRN] 26 slow requests, 6 included below;
> >>>>>>>>>>>>>>>>> oldest blocked for > 46.746163 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:31.510711 osd.22 [WRN] slow request 31.035418
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:00.475236:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900595
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2097152~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:31.510715
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.035335 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.475319:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900596
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2162688~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:31.510718
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.035270 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.475384:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900597
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2400256~16384] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:31.510721
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.035093 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.475561:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900598
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2420736~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:31.510724
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.034990 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.475664:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900599
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2428928~12288] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.510864
> >>>>>>>>>>>>>>>>> osd.22 [WRN] 28 slow requests, 6 included below;
> >>>>>>>>>>>>>>>>> oldest blocked for > 47.746324 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:32.510871 osd.22 [WRN] slow request 32.035026
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:00.475789:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900600
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2453504~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.510876
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 32.034982 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.475833:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900601
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2502656~16384] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.510881
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 32.034917 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.475898:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900602
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2600960~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.510885
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 32.034807 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476008:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900603
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2617344~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.510890
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 32.034756 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476059:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900604
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2637824~69632] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.723329
> >>>>>>>>>>>>>>>>> osd.17 [WRN] 4 slow requests, 4 included below; oldest
> >>>>>>>>>>>>>>>>> blocked for > 30.605794 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:32.723335 osd.17 [WRN] slow request 30.605794
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:02.117495:
> >>>>>>>>>>>>>>>>> osd_op(client.12646411.0:31118135
> >>>>>>>>>>>>>>>>> rbd_data.81aaf52eb141f2.0000000000000e4b [stat,write
> >>>>>>>>>>>>>>>>> 0~4194304] 3.716b373b e13901) v4 currently waiting for
> >>>>>>>>>>>>>>>>> subops from [31,1] 2014-09-04 11:56:32.723338 osd.17
> >>>>>>>>>>>>>>>>> [WRN] slow request 30.362042 seconds old, received at
> >>>>>>>>>>>>>>>>> 2014-09-04 11:56:02.361247:
> >>>>>>>>>>>>>>>>> osd_op(client.12485793.0:121786489
> >>>>>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.000000000000e200 [stat,write
> >>>>>>>>>>>>>>>>> 2834432~4096] 3.6b43de72 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [5,31] 2014-09-04 11:56:32.723341
> >>>>>>>>>>>>>>>>> osd.17 [WRN] slow request 30.179681 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:02.543608:
> >>>>>>>>>>>>>>>>> osd_op(client.13425079.0:7014393
> >>>>>>>>>>>>>>>>> rbd_data.54f2522eb141f2.0000000000014d01 [stat,write
> >>>>>>>>>>>>>>>>> 2502656~4096] 3.d8ceea89 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [31,18] 2014-09-04 11:56:32.723344
> >>>>>>>>>>>>>>>>> osd.17 [WRN] slow request 30.000568 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:02.722721:
> >>>>>>>>>>>>>>>>> osd_op(client.9281279.0:144230325
> >>>>>>>>>>>>>>>>> rbd_data.2d99fc2eb141f2.000000000000e80a [stat,write
> >>>>>>>>>>>>>>>>> 1589248~4096] 3.16dc0e89 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [31,18] 2014-09-04 11:56:33.511043
> >>>>>>>>>>>>>>>>> osd.22 [WRN] 33 slow requests, 6 included below;
> >>>>>>>>>>>>>>>>> oldest blocked for > 48.746486 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:33.511049 osd.22 [WRN] slow request 33.034756
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:00.476221:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900605
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2715648~8192] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:33.511053
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 33.034620 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476357:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900606
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2756608~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:33.511056
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 33.034446 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476531:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900607
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2809856~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:33.511060
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 33.034387 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476590:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900608
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2883584~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:33.511063
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 33.034341 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476636:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900609
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2891776~8192] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:34.511206
> >>>>>>>>>>>>>>>>> osd.22 [WRN] 38 slow requests, 6 included below;
> >>>>>>>>>>>>>>>>> oldest blocked for > 49.746664 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:34.511215 osd.22 [WRN] slow request 34.034426
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:00.476729:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900610
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2908160~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:34.511219
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 34.034298 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476857:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900611
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2932736~28672] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:34.511222
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 34.034234 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.476921:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900612
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2969600~8192] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:34.511225
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 34.034147 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.477008:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900613
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2981888~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:34.511228
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 34.034055 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.477100:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900614
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 2998272~12288] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:35.153813
> >>>>>>>>>>>>>>>>> osd.37 [WRN] 1 slow requests, 1 included below; oldest
> >>>>>>>>>>>>>>>>> blocked for > 30.307864 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:35.153820 osd.37 [WRN] slow request 30.307864
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:04.845917:
> >>>>>>>>>>>>>>>>> osd_op(client.10785413.0:129409910
> >>>>>>>>>>>>>>>>> rbd_data.96803f2eb141f2.00000000000042b0 [stat,write
> >>>>>>>>>>>>>>>>> 1712128~4096] 3.6c5ac950 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [22,41] 2014-09-04 11:56:35.511450
> >>>>>>>>>>>>>>>>> osd.22 [WRN] 42 slow requests, 6 included below;
> >>>>>>>>>>>>>>>>> oldest blocked for > 50.746638 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:35.511455 osd.22 [WRN] slow request 35.034219
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:00.477178:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900615
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 3022848~8192] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:35.511459
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 35.034143 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.477254:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900616
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 3035136~12288] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:35.511462
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 35.034082 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.477315:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900617
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 3055616~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:35.511465
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 35.033965 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:00.477432:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135900618
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>>>>> 3063808~4096] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:35.511469
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 34.162275 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:01.349122:
> >>>>>>>>>>>>>>>>> osd_op(client.12485646.0:110385143
> >>>>>>>>>>>>>>>>> rbd_data.4ccab12eb141f2.0000000000015b66 [stat,write
> >>>>>>>>>>>>>>>>> 909312~4096] 3.9c416abe e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:36.511603
> >>>>>>>>>>>>>>>>> osd.22 [WRN] 34 slow requests, 6 included below;
> >>>>>>>>>>>>>>>>> oldest blocked for > 51.746797 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:36.511609 osd.22 [WRN] slow request 34.401476
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:02.110080:
> >>>>>>>>>>>>>>>>> osd_op(client.12485643.0:117955489
> >>>>>>>>>>>>>>>>> rbd_data.4aad332eb141f2.0000000000002c00 [stat,write
> >>>>>>>>>>>>>>>>> 2531328~4096] 3.b985ec92 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [33,1] 2014-09-04 11:56:36.511612
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 34.149511 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:02.362045:
> >>>>>>>>>>>>>>>>> osd_op(client.12485793.0:121786505
> >>>>>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write
> >>>>>>>>>>>>>>>>> 3342336~4096] 3.26907492 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [33,1] 2014-09-04 11:56:36.511615
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 34.329528 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:02.182028:
> >>>>>>>>>>>>>>>>> osd_op(client.12646411.0:31118266
> >>>>>>>>>>>>>>>>> rbd_data.81aaf52eb141f2.0000000000000e4f [stat,write
> >>>>>>>>>>>>>>>>> 0~4194304] 3.168c978d e13901) v4 currently waiting for
> >>>>>>>>>>>>>>>>> subops from [49,29] 2014-09-04 11:56:36.511618 osd.22
> >>>>>>>>>>>>>>>>> [WRN] slow request 31.871976 seconds old, received at
> >>>>>>>>>>>>>>>>> 2014-09-04 11:56:04.639580:
> >>>>>>>>>>>>>>>>> osd_op(client.12485793.0:121786506
> >>>>>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write
> >>>>>>>>>>>>>>>>> 0~4096] 3.26907492 e13901) v4 currently waiting for
> >>>>>>>>>>>>>>>>> subops from [33,1] 2014-09-04 11:56:36.511621 osd.22
> >>>>>>>>>>>>>>>>> [WRN] slow request 31.008867 seconds old, received at
> >>>>>>>>>>>>>>>>> 2014-09-04 11:56:05.502689:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135901516
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.0000000000017436 [stat,write
> >>>>>>>>>>>>>>>>> 2682880~4096] 3.32067b97 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [12,29] 2014-09-04 11:56:36.723882
> >>>>>>>>>>>>>>>>> osd.17 [WRN] 5 slow requests, 1 included below; oldest
> >>>>>>>>>>>>>>>>> blocked for > 34.606349 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:36.723887 osd.17 [WRN] slow request 30.603057
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:06.120787:
> >>>>>>>>>>>>>>>>> osd_op(client.12483520.0:11473870
> >>>>>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000004600 [stat,write
> >>>>>>>>>>>>>>>>> 2293760~4096] 3.49398689 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [31,18] 2014-09-04 11:56:37.511787
> >>>>>>>>>>>>>>>>> osd.22 [WRN] 24 slow requests, 6 included below;
> >>>>>>>>>>>>>>>>> oldest blocked for > 52.746984 secs 2014-09-04
> >>>>>>>>>>>>>>>>> 11:56:37.511793 osd.22 [WRN] slow request 32.008986
> >>>>>>>>>>>>>>>>> seconds old, received at 2014-09-04 11:56:05.502757:
> >>>>>>>>>>>>>>>>> osd_op(client.9266625.0:135901517
> >>>>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.0000000000017436 [stat,write
> >>>>>>>>>>>>>>>>> 2695168~4096] 3.32067b97 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [12,29] 2014-09-04 11:56:37.511798
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.749483 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:05.762260:
> >>>>>>>>>>>>>>>>> osd_op(client.9281282.0:55728036
> >>>>>>>>>>>>>>>>> rbd_data.2d99f62eb141f2.0000000000003753 [stat,write
> >>>>>>>>>>>>>>>>> 1630208~8192] 3.6625df97 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [12,29] 2014-09-04 11:56:37.511801
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.745558 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:05.766185:
> >>>>>>>>>>>>>>>>> osd_op(client.9281282.0:55728038
> >>>>>>>>>>>>>>>>> rbd_data.2d99f62eb141f2.0000000000003753 [stat,write
> >>>>>>>>>>>>>>>>> 1941504~4096] 3.6625df97 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [12,29] 2014-09-04 11:56:37.511803
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.390945 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:06.120798:
> >>>>>>>>>>>>>>>>> osd_op(client.12483520.0:11473871
> >>>>>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write
> >>>>>>>>>>>>>>>>> 2424832~8192] 3.47dbbb97 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [12,29] 2014-09-04 11:56:37.511806
> >>>>>>>>>>>>>>>>> osd.22 [WRN] slow request 31.390357 seconds old,
> >>>>>>>>>>>>>>>>> received at 2014-09-04 11:56:06.121386:
> >>>>>>>>>>>>>>>>> osd_op(client.12483520.0:11473884
> >>>>>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write
> >>>>>>>>>>>>>>>>> 2457600~16384] 3.47dbbb97 e13901) v4 currently waiting
> >>>>>>>>>>>>>>>>> for subops from [12,29]
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> Kind Regards,
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Christian Balzer        Network/Systems Engineer
> >>>>>>>>>>>>>>>> chibi at gol.com           Global OnLine Japan/Fusion
> >>>>>>>>>>>>>>>> Communications http://www.gol.com/
> >>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Christian Balzer        Network/Systems Engineer
> >>>>>>>>>>>>>> chibi at gol.com         Global OnLine Japan/Fusion
> >>>>>>>>>>>>>> Communications http://www.gol.com/
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> -- 
> >>>>>>>>>>>>> Luis Periquito
> >>>>>>>>>>>>> Unix Engineer
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Ocado.com
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Head Office, Titan Court, 3 Bishop Square, Hatfield
> >>>>>>>>>>>>> Business Park, Hatfield, Herts AL10 9NE
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Notice:  This email is confidential and may contain
> >>>>>>>>>>>>> copyright material of members of the Ocado Group. Opinions
> >>>>>>>>>>>>> and views expressed in this message may not necessarily
> >>>>>>>>>>>>> reflect the opinions and views of the members of the Ocado
> >>>>>>>>>>>>> Group.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> If you are not the intended recipient, please notify us
> >>>>>>>>>>>>> immediately and delete all copies of this message. Please
> >>>>>>>>>>>>> note that it is your responsibility to scan this message
> >>>>>>>>>>>>> for viruses. 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> References to the ?Ocado Group? are to Ocado Group plc
> >>>>>>>>>>>>> (registered in England and Wales with number 7098618) and
> >>>>>>>>>>>>> its subsidiary undertakings (as that expression is defined
> >>>>>>>>>>>>> in the Companies Act 2006) from time to time.  The
> >>>>>>>>>>>>> registered office of Ocado Group plc is Titan Court, 3
> >>>>>>>>>>>>> Bishops Square, Hatfield Business Park, Hatfield, Herts.
> >>>>>>>>>>>>> AL10 9NE.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> -- 
> >>>>>>>>>> Christian Balzer        Network/Systems
> >>>>>>>>>> Engineer chibi at gol.com   	Global OnLine Japan/Fusion
> >>>>>>>>>> Communications http://www.gol.com/
> >>>>>>>>> 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list
> >>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list
> >>>>>>> ceph-users at lists.ceph.com
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>> 
> >>>>>> 
> >>>>> 
> >>>>> 
> >>>>> -- 
> >>>>> Christian Balzer        Network/Systems Engineer                
> >>>>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> >>>>> http://www.gol.com/
> >>>> 
> >>>> 
> >>> 
> >>> 
> >>> -- 
> >>> Christian Balzer        Network/Systems Engineer                
> >>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> >>> http://www.gol.com/
> >> 
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users at lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux