Re: osd max scrubs not honored?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sun, 15 Oct 2017 13:14:00 +0200

On 2017-10-14 05:02, J David wrote:

Thanks all for input on this.

 It's taken a couple of weeks, but based on the feedback from the list,
 we've got our version of a scrub-one-at-a-time cron script running and
 confirmed that it's working properly.

 Unfortunately, this hasn't really solved the real problem.  Even with
 just one scrub and one client running, client I/O requests routinely
 take 30-60 seconds to complete (read or write), which is so poor that
 the cluster is unusable for any sort of interactive activity.  Nobody
 is going to sit around and wait 30-60 seconds for a file to save or
 load, or for a web server to respond, or a SQL query to finish.

 Running "ceph -w" blames this on slow requests blocked for > 32 seconds:

 2017-10-13 21:21:34.445798 mon.ceph1 [INF] overall HEALTH_OK
 2017-10-13 21:21:51.305661 mon.ceph1 [WRN] Health check failed: 42
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 21:21:57.311892 mon.ceph1 [WRN] Health check update: 140
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 21:22:03.343443 mon.ceph1 [WRN] Health check update: 111
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 21:22:01.833605 osd.5 [WRN] 1 slow requests, 1 included
 below; oldest blocked for > 30.526819 secs
 2017-10-13 21:22:01.833614 osd.5 [WRN] slow request 30.526819 seconds
 old, received at 2017-10-13 21:21:31.306718:
 osd_op(client.6104975.0:7330926 0.a2
 0:456218c9:::rbd_data.1a24832ae8944a.000000000009d21d:head
 [set-alloc-hint object_size 4194304 write_size 4194304,write
 2364416~88064] snapc 0=[] ondisk+write+known_if_redirected e18866)
 currently sub_op_commit_rec from 9
 2017-10-13 21:22:11.238561 mon.ceph1 [WRN] Health check update: 24
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 21:22:04.834075 osd.5 [WRN] 1 slow requests, 1 included
 below; oldest blocked for > 30.291869 secs
 2017-10-13 21:22:04.834082 osd.5 [WRN] slow request 30.291869 seconds
 old, received at 2017-10-13 21:21:34.542137:
 osd_op(client.6104975.0:7331703 0.a2
 0:4571f0f6:::rbd_data.1a24832ae8944a.000000000009c8ef:head
 [set-alloc-hint object_size 4194304 write_size 4194304,write
 2934272~46592] snapc 0=[] ondisk+write+known_if_redirected e18866)
 currently op_applied
 2017-10-13 21:22:07.834445 osd.5 [WRN] 1 slow requests, 1 included
 below; oldest blocked for > 30.421122 secs
 2017-10-13 21:22:07.834452 osd.5 [WRN] slow request 30.421122 seconds
 old, received at 2017-10-13 21:21:37.413260:
 osd_op(client.6104975.0:7332411 0.a2
 0:456218c9:::rbd_data.1a24832ae8944a.000000000009d21d:head
 [set-alloc-hint object_size 4194304 write_size 4194304,write
 4068352~16384] snapc 0=[] ondisk+write+known_if_redirected e18866)
 currently op_applied
 2017-10-13 21:22:16.238929 mon.ceph1 [WRN] Health check update: 8 slow
 requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 21:22:21.239234 mon.ceph1 [WRN] Health check update: 4 slow
 requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 21:22:21.329402 mon.ceph1 [INF] Health check cleared:
 REQUEST_SLOW (was: 4 slow requests are blocked > 32 sec)
 2017-10-13 21:22:21.329490 mon.ceph1 [INF] Cluster is now healthy

 So far, the following steps have been taken to attempt to resolve this:

 1) Updated to Ubuntu 16.04.3 LTS and Ceph 12.2.1.

 2) Changes to ceph.conf:
 osd max scrubs = 1
 osd scrub during recovery = false
 osd deep scrub interval = 2592000
 osd scrub max interval = 2592000
 osd deep scrub randomize ratio = 0.0
 osd disk thread ioprio priority = 7
 osd disk thread ioprio class = idle
 osd scrub sleep = 0.1

 3) Kernel I/O Scheduler set to cfq.

 4) Deep-scrub moved to cron, with a limit of one running at a time.

 With these changes, scrubs now take 40-45 minutes to complete, up from
 20-25, so the amount of time where there are client I/O issues has
 actually gotten substantially worse.

 To summarize the ceph cluster, it has five nodes.  Each node has
 - Intel Xeon E5-1620 v3 3.5Ghz quad core CPU
 - 64GiB DDR4 1866
 - Intel SSD DC S3700 1GB divided into three partitions used from
 Bluestore blocks.db for each OSD
 - Separate 64GB SSD for ceph monitor data & system image.
 - 3 x 7200rpm drives (Seagate Constellation ES.3 4TB or Seagate
 Enterprise Capacity 8TB)
 - Dual Intel 10Gigabit NIC w/LACP

 The SATA drives all check out healthy via smartctl and several are
 either new and were tested right before insertion into this cluster,
 or have been pulled for testing.  When tested on random operations,
 they are by and large capable of 120-150 IOPS and about 30MB/sec
 throughput at 100% utilization with response times of 5-7ms.

 The CPUs are 75-90% idle.  The RAM is largely unused (~55GiB free).
 The network is nearly idle (<50Mbps TX & RX, often <10Mbps).  The
 blocks.db SSDs report 0% to 0.2% utilization.  The system/monitor SSD
 reports 0-0.5% utilization.  The SATA drives report between 0 and 100%
 utilization.

 If I turn off the client and just let one deep scrub run, then I'll see

 With one scrub only (client turned off), most of the drives at
 negligible utilization, but three at 60-100% utilization (all reads,
 about 30MB/sec — so, pretty much flat out), and three with the
 utilization match the three OSD's that service the pg being
 deep-scrubbed.

 With the client only (scrub script disabled, no scrub running), the
 SATA disks run 10-30% utilization, doing 5-40 IOPS.  "ceph status"
 every ten seconds over two minutes reports:
     client:   2196 kB/s rd, 10291 kB/s wr, 164 op/s rd, 202 op/s wr
     client:   1989 kB/s rd, 14932 kB/s wr, 136 op/s rd, 194 op/s wr
     client:   1569 kB/s rd, 9648 kB/s wr, 130 op/s rd, 158 op/s wr
     client:   1949 kB/s rd, 11310 kB/s wr, 176 op/s rd, 173 op/s wr
     client:   2515 kB/s rd, 6987 kB/s wr, 120 op/s rd, 134 op/s wr
     client:   1730 kB/s rd, 7516 kB/s wr, 114 op/s rd, 114 op/s wr
     client:   1745 kB/s rd, 7575 kB/s wr, 120 op/s rd, 123 op/s wr
     client:   1407 kB/s rd, 4975 kB/s wr, 129 op/s rd, 121 op/s wr
     client:   1161 kB/s rd, 5053 kB/s wr, 60 op/s rd, 152 op/s wr
     client:   2551 kB/s rd, 7975 kB/s wr, 92 op/s rd, 140 op/s wr
     client:   1310 kB/s rd, 10059 kB/s wr, 47 op/s rd, 171 op/s wr
     client:   1257 kB/s rd, 8717 kB/s wr, 60 op/s rd, 164 op/s wr
 AVERAGE: 1782 kB/s rd, 8753 kB/s wr, 112 op/s rd, 154 op/s wr

 With the client and one deep-scrub running, the I/O utilizations of 3
 disks are 70-100%, the rest at 0-30%.  "ceph status" every ten seconds
 over two minutes reports:
     client:   380 kB/s rd, 770 kB/s wr, 9 op/s rd, 64 op/s wr
     client:   1158 kB/s rd, 739 kB/s wr, 55 op/s rd, 44 op/s wr
     client:   447 kB/s rd, 4486 kB/s wr, 20 op/s rd, 221 op/s wr
     client:   467 kB/s rd, 17906 B/s wr, 35 op/s rd, 3 op/s wr
     client:   507 kB/s rd, 462 kB/s wr, 19 op/s rd, 11 op/s wr
     client:   795 kB/s rd, 1553 kB/s wr, 14 op/s rd, 38 op/s wr
     client:   349 kB/s rd, 994 kB/s wr, 32 op/s rd, 84 op/s wr
     client:   1055 kB/s rd, 83983 B/s wr, 28 op/s rd, 9 op/s wr
     client:   493 kB/s rd, 363 kB/s wr, 20 op/s rd, 30 op/s wr
     client:   616 kB/s rd, 2113 kB/s wr, 28 op/s rd, 137 op/s wr
     client:   393 kB/s rd, 1253 kB/s wr, 42 op/s rd, 35 op/s wr
     client:   495 kB/s rd, 2653 kB/s wr, 53 op/s rd, 206 op/s wr
 AVERAGE: 596 kB/s rd, 1290 kB/s wr, 30 op/s rd, 74 op/s wr

 The client workload was the same in both cases.

 So the performance impact of a single deep scrub on a single client is
 67% less read throughput, 85% less write throughput, 73% less read
 IOPS, and 52% less IOPS.

 (This is still a dramatic improvement from where we started; without
 the ceph.conf tuning and with 2-3 scrubs running, it was fairly
 routine for the ceph -w to show no client read requests completing at
 all for over a minute.)

 But it seems like even with these settings, the deep scrub is still
 capable of starving the client.

 For example, with this PG being deep-scrubbed:

 $ sudo ceph pg dump | fgrep scrub
 dumped all
 0.4c      10331                  0        0         0       0
 43330306048 1566     1566 active+clean+scrubbing+deep 2017-10-13
 22:57:48.482152  18866'9181844  18866:6689009   [6,10,5]          6
 [6,10,5]              6  18765'9167507 2017-10-12 17:38:50.990794
 18666'8968567 2017-09-29 01:28:42.558810

 The affected OSDs are 6, 10, and 5.

 When things go amiss:

 2017-10-13 23:04:44.217383 mon.ceph1 [INF] Cluster is now healthy
 2017-10-13 23:05:02.963973 mon.ceph1 [WRN] Health check failed: 3 slow
 requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 23:05:08.445170 mon.ceph1 [WRN] Health check update: 19
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 23:05:14.705356 mon.ceph1 [WRN] Health check update: 34
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-13 23:05:20.884466 mon.ceph1 [INF] Health check cleared:
 REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
 2017-10-13 23:05:20.884522 mon.ceph1 [INF] Cluster is now healthy

 Running "ceph health detail" repeatedly confirms that it is (all of)
 the OSDs being scrubbed that are stalling:

 HEALTH_OK

 HEALTH_WARN 3 slow requests are blocked > 32 sec
 REQUEST_SLOW 3 slow requests are blocked > 32 sec
     3 ops are blocked > 32.768 sec
     osd.5 has blocked requests > 32.768 sec

 HEALTH_WARN 3 slow requests are blocked > 32 sec
 REQUEST_SLOW 3 slow requests are blocked > 32 sec
     3 ops are blocked > 32.768 sec
     osd.5 has blocked requests > 32.768 sec

 HEALTH_WARN 38 slow requests are blocked > 32 sec
 REQUEST_SLOW 38 slow requests are blocked > 32 sec
     38 ops are blocked > 32.768 sec
     osds 5,6 have blocked requests > 32.768 sec

 HEALTH_WARN 38 slow requests are blocked > 32 sec
 REQUEST_SLOW 38 slow requests are blocked > 32 sec
     38 ops are blocked > 32.768 sec
     osds 5,6 have blocked requests > 32.768 sec

 HEALTH_WARN 19 slow requests are blocked > 32 sec
 REQUEST_SLOW 19 slow requests are blocked > 32 sec
     19 ops are blocked > 32.768 sec
     osds 5,6 have blocked requests > 32.768 sec

 HEALTH_WARN 19 slow requests are blocked > 32 sec
 REQUEST_SLOW 19 slow requests are blocked > 32 sec
     19 ops are blocked > 32.768 sec
     osds 5,6 have blocked requests > 32.768 sec

 HEALTH_WARN 35 slow requests are blocked > 32 sec
 REQUEST_SLOW 35 slow requests are blocked > 32 sec
     35 ops are blocked > 32.768 sec
     osds 5,6,10 have blocked requests > 32.768 sec

 HEALTH_WARN 35 slow requests are blocked > 32 sec
 REQUEST_SLOW 35 slow requests are blocked > 32 sec
     35 ops are blocked > 32.768 sec
     osds 5,6,10 have blocked requests > 32.768 sec

 HEALTH_WARN 29 slow requests are blocked > 32 sec
 REQUEST_SLOW 29 slow requests are blocked > 32 sec
     29 ops are blocked > 32.768 sec
     osds 6,10 have blocked requests > 32.768 sec

 HEALTH_WARN 34 slow requests are blocked > 32 sec
 REQUEST_SLOW 34 slow requests are blocked > 32 sec
     34 ops are blocked > 32.768 sec
     osds 6,10 have blocked requests > 32.768 sec

 HEALTH_WARN 34 slow requests are blocked > 32 sec
 REQUEST_SLOW 34 slow requests are blocked > 32 sec
     34 ops are blocked > 32.768 sec
     osds 6,10 have blocked requests > 32.768 sec

 HEALTH_WARN 34 slow requests are blocked > 32 sec
 REQUEST_SLOW 34 slow requests are blocked > 32 sec
     34 ops are blocked > 32.768 sec
     osds 6,10 have blocked requests > 32.768 sec

 HEALTH_WARN 34 slow requests are blocked > 32 sec
 REQUEST_SLOW 34 slow requests are blocked > 32 sec
     34 ops are blocked > 32.768 sec
     osds 6,10 have blocked requests > 32.768 sec

 HEALTH_WARN 12 slow requests are blocked > 32 sec
 REQUEST_SLOW 12 slow requests are blocked > 32 sec
     12 ops are blocked > 32.768 sec
     osd.10 has blocked requests > 32.768 sec

 HEALTH_OK

 A different sample taken later where ceph -w happened to spit out the
 details of slow requests shows that the slowness follows the OSDs, not
 the PG being deep-scrubbed:

 $ sudo ceph pg dump | fgrep scrub
 dumped all
 0.1a3     10479                  0        0         0       0
 43951919104 1564     1564 active+clean+scrubbing+deep 2017-10-14
 02:13:06.046930  18866'9070989  18866:6495502  [14,9,15]         14
 [14,9,15]             14  18866'9060892 2017-10-13 11:22:33.563810
 18666'8855616 2017-09-29 08:14:06.935978

 Although all of the affected requests shown are on an OSD
 participating in a deep-scrub, none of the ones shown are for the PG
 being scrubbed (0.1a3, in this case):

 2017-10-14 02:30:58.876111 osd.15 [WRN] 54 slow requests, 5 included
 below; oldest blocked for > 40.738962 secs
 2017-10-14 02:30:58.876117 osd.15 [WRN] slow request 30.042803 seconds
 old, received at 2017-10-14 02:30:28.833218:
 osd_op(client.6210820.0:962249 0.167 0.6a309767 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:58.876121 osd.15 [WRN] slow request 30.350780 seconds
 old, received at 2017-10-14 02:30:28.525241:
 osd_op(client.6198346.0:931036 0.4f 0.66fafa4f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:58.876125 osd.15 [WRN] slow request 34.967224 seconds
 old, received at 2017-10-14 02:30:23.908796:
 osd_op(client.6210820.0:961833 0.167 0.f744d67 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:58.876129 osd.15 [WRN] slow request 33.858779 seconds
 old, received at 2017-10-14 02:30:25.017242:
 osd_op(client.6210820.0:961949 0.4f 0.fb8a24f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:58.876152 osd.15 [WRN] slow request 32.730421 seconds
 old, received at 2017-10-14 02:30:26.145599:
 osd_op(client.6210820.0:962067 0.4f 0.2ea1fa4f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:59.876364 osd.15 [WRN] 47 slow requests, 5 included
 below; oldest blocked for > 39.876936 secs
 2017-10-14 02:30:59.876371 osd.15 [WRN] slow request 30.347226 seconds
 old, received at 2017-10-14 02:30:29.529043:
 osd_op(client.6198346.0:931188 0.12b 0.c13f692b (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:59.876375 osd.15 [WRN] slow request 30.346784 seconds
 old, received at 2017-10-14 02:30:29.529485:
 osd_op(client.6198346.0:931189 0.12b 0.c13f692b (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:59.876379 osd.15 [WRN] slow request 30.899518 seconds
 old, received at 2017-10-14 02:30:28.976750:
 osd_op(client.6198346.0:931114 0.4f 0.b02f084f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:59.876383 osd.15 [WRN] slow request 30.899370 seconds
 old, received at 2017-10-14 02:30:28.976898:
 osd_op(client.6198346.0:931115 0.4f 0.b02f084f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:30:59.876386 osd.15 [WRN] slow request 30.851203 seconds
 old, received at 2017-10-14 02:30:29.025066:
 osd_op(client.6198346.0:931116 0.4f 0.b02f084f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:00.876599 osd.15 [WRN] 52 slow requests, 5 included
 below; oldest blocked for > 40.877172 secs
 2017-10-14 02:31:00.876605 osd.15 [WRN] slow request 30.883711 seconds
 old, received at 2017-10-14 02:30:29.992794:
 osd_op(client.6198346.0:931211 0.36 0.b9aba836 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:00.876610 osd.15 [WRN] slow request 30.513765 seconds
 old, received at 2017-10-14 02:30:30.362740:
 osd_op(client.6198346.0:931274 0.a9 0.2d8d5aa9 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:00.876613 osd.15 [WRN] slow request 30.017060 seconds
 old, received at 2017-10-14 02:30:30.859445:
 osd_op(client.6210820.0:962360 0.a9 0.7db19aa9 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:00.876617 osd.15 [WRN] slow request 31.501782 seconds
 old, received at 2017-10-14 02:30:29.374722:
 osd_op(client.6198346.0:931183 0.167 0.c0db5367 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:00.876620 osd.15 [WRN] slow request 30.288990 seconds
 old, received at 2017-10-14 02:30:30.587515:
 osd_op(client.6210820.0:962354 0.4f 0.d04a024f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:01.876820 osd.15 [WRN] 54 slow requests, 5 included
 below; oldest blocked for > 41.877395 secs
 2017-10-14 02:31:01.876826 osd.15 [WRN] slow request 30.051551 seconds
 old, received at 2017-10-14 02:30:31.825177:
 osd_op(client.6198346.0:931372 0.a9 0.ae2afca9 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:01.876830 osd.15 [WRN] slow request 31.180720 seconds
 old, received at 2017-10-14 02:30:30.696008:
 osd_op(client.6198346.0:931303 0.4f 0.d960a04f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:01.876833 osd.15 [WRN] slow request 30.097627 seconds
 old, received at 2017-10-14 02:30:31.779100:
 osd_op(client.6198346.0:931371 0.a9 0.ae2afca9 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:01.876837 osd.15 [WRN] slow request 36.859324 seconds
 old, received at 2017-10-14 02:30:25.017404:
 osd_op(client.6210820.0:961950 0.4f 0.fb8a24f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:01.876840 osd.15 [WRN] slow request 36.106802 seconds
 old, received at 2017-10-14 02:30:25.769926:
 osd_op(client.6198346.0:930836 0.81 0.cdb24281 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:02.877030 osd.15 [WRN] 45 slow requests, 5 included
 below; oldest blocked for > 42.630584 secs
 2017-10-14 02:31:02.877037 osd.15 [WRN] slow request 30.497084 seconds
 old, received at 2017-10-14 02:30:32.379855:
 osd_op(client.6198346.0:931402 0.db 0.1b5228db (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:02.877041 osd.15 [WRN] slow request 30.045499 seconds
 old, received at 2017-10-14 02:30:32.831441:
 osd_op(client.6198346.0:931439 0.19e 0.bf928d9e (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:02.877045 osd.15 [WRN] slow request 30.045355 seconds
 old, received at 2017-10-14 02:30:32.831584:
 osd_op(client.6198346.0:931440 0.19e 0.bf928d9e (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:02.877048 osd.15 [WRN] slow request 33.937728 seconds
 old, received at 2017-10-14 02:30:28.939211:
 osd_op(client.6198346.0:931109 0.16c 0.ecf9016c (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:02.877051 osd.15 [WRN] slow request 39.956090 seconds
 old, received at 2017-10-14 02:30:22.920850:
 osd_op(client.6210820.0:961696 0.167 0.74cc8167 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:03.877265 osd.15 [WRN] 48 slow requests, 4 included
 below; oldest blocked for > 43.630826 secs
 2017-10-14 02:31:03.877271 osd.15 [WRN] slow request 30.265866 seconds
 old, received at 2017-10-14 02:30:33.611316:
 osd_op(client.6198346.0:931511 0.12b 0.f9f6932b (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:03.877275 osd.15 [WRN] slow request 30.736755 seconds
 old, received at 2017-10-14 02:30:33.140427:
 osd_op(client.6198346.0:931461 0.81 0.67d8881 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:03.877279 osd.15 [WRN] slow request 30.068784 seconds
 old, received at 2017-10-14 02:30:33.808398:
 osd_op(client.6198346.0:931521 0.167 0.c39f1b67 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:03.877283 osd.15 [WRN] slow request 34.937801 seconds
 old, received at 2017-10-14 02:30:28.939380:
 osd_op(client.6198346.0:931110 0.16c 0.ecf9016c (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:04.877468 osd.15 [WRN] 50 slow requests, 2 included
 below; oldest blocked for > 44.631044 secs
 2017-10-14 02:31:04.877474 osd.15 [WRN] slow request 30.100364 seconds
 old, received at 2017-10-14 02:30:34.777036:
 osd_op(client.6198346.0:931601 0.12b 0.824c6d2b (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:04.877478 osd.15 [WRN] slow request 30.598084 seconds
 old, received at 2017-10-14 02:30:34.279316:
 osd_op(client.6210820.0:962473 0.1c1 0.d0030dc1 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:05.877693 osd.15 [WRN] 54 slow requests, 4 included
 below; oldest blocked for > 45.631254 secs
 2017-10-14 02:31:05.877699 osd.15 [WRN] slow request 30.613464 seconds
 old, received at 2017-10-14 02:30:35.264146:
 osd_op(client.6210820.0:962514 0.12b 0.d9b2c92b (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:05.877703 osd.15 [WRN] slow request 30.564546 seconds
 old, received at 2017-10-14 02:30:35.313064:
 osd_op(client.6198346.0:931635 0.167 0.b6530d67 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:05.877707 osd.15 [WRN] slow request 30.561430 seconds
 old, received at 2017-10-14 02:30:35.316180:
 osd_op(client.6210820.0:962520 0.12b 0.2bc1b12b (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:05.877711 osd.15 [WRN] slow request 30.231697 seconds
 old, received at 2017-10-14 02:30:35.645913:
 osd_op(client.6198346.0:931670 0.19e 0.1063099e (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:06.877917 osd.15 [WRN] 45 slow requests, 4 included
 below; oldest blocked for > 45.044786 secs
 2017-10-14 02:31:06.877925 osd.15 [WRN] slow request 30.860692 seconds
 old, received at 2017-10-14 02:30:36.017137:
 osd_op(client.6198346.0:931696 0.167 0.a0e34b67 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:06.877929 osd.15 [WRN] slow request 30.860517 seconds
 old, received at 2017-10-14 02:30:36.017312:
 osd_op(client.6198346.0:931697 0.167 0.a0e34b67 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:06.877933 osd.15 [WRN] slow request 30.860432 seconds
 old, received at 2017-10-14 02:30:36.017397:
 osd_op(client.6198346.0:931698 0.167 0.a0e34b67 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:06.877936 osd.15 [WRN] slow request 30.695287 seconds
 old, received at 2017-10-14 02:30:36.182542:
 osd_op(client.6210820.0:962573 0.14e 0.23c5354e (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:10.449873 mon.ceph1 [WRN] Health check update: 72
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-14 02:31:15.450202 mon.ceph1 [WRN] Health check update: 59
 slow requests are blocked > 32 sec (REQUEST_SLOW)
 2017-10-14 02:31:08.878201 osd.15 [WRN] 48 slow requests, 3 included
 below; oldest blocked for > 47.045085 secs
 2017-10-14 02:31:08.878207 osd.15 [WRN] slow request 30.788227 seconds
 old, received at 2017-10-14 02:30:38.089900:
 osd_op(client.6210820.0:962712 0.144 0.9a2d1144 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:08.878212 osd.15 [WRN] slow request 30.788130 seconds
 old, received at 2017-10-14 02:30:38.089997:
 osd_op(client.6210820.0:962713 0.144 0.9a2d1144 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:08.878215 osd.15 [WRN] slow request 30.786906 seconds
 old, received at 2017-10-14 02:30:38.091221:
 osd_op(client.6210820.0:962747 0.1c1 0.60f887c1 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:09.878410 osd.15 [WRN] 42 slow requests, 1 included
 below; oldest blocked for > 46.828631 secs
 2017-10-14 02:31:09.878417 osd.15 [WRN] slow request 30.367803 seconds
 old, received at 2017-10-14 02:30:39.510546:
 osd_op(client.6210820.0:962753 0.16c 0.209b656c (undecoded)
 ondisk+read+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:10.878600 osd.15 [WRN] 44 slow requests, 2 included
 below; oldest blocked for > 47.828813 secs
 2017-10-14 02:31:10.878625 osd.15 [WRN] slow request 30.023391 seconds
 old, received at 2017-10-14 02:30:40.855140:
 osd_op(client.6198346.0:931975 0.12b 0.96d4852b (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:10.878630 osd.15 [WRN] slow request 30.693944 seconds
 old, received at 2017-10-14 02:30:40.184587:
 osd_op(client.6198346.0:931947 0.15d 0.360d7d5d (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:11.878798 osd.15 [WRN] 45 slow requests, 1 included
 below; oldest blocked for > 48.829020 secs
 2017-10-14 02:31:11.878805 osd.15 [WRN] slow request 30.753634 seconds
 old, received at 2017-10-14 02:30:41.125105:
 osd_op(client.6198346.0:932010 0.1c1 0.80369dc1 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:12.878984 osd.15 [WRN] 47 slow requests, 2 included
 below; oldest blocked for > 49.829201 secs
 2017-10-14 02:31:12.878993 osd.15 [WRN] slow request 30.042094 seconds
 old, received at 2017-10-14 02:30:42.836824:
 osd_op(client.6210820.0:962766 0.14e 0.25cb5d4e (undecoded)
 ondisk+read+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:12.879001 osd.15 [WRN] slow request 30.463578 seconds
 old, received at 2017-10-14 02:30:42.415340:
 osd_op(client.6198346.0:932067 0.db 0.895084db (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:13.879166 osd.15 [WRN] 48 slow requests, 1 included
 below; oldest blocked for > 50.829393 secs
 2017-10-14 02:31:13.879172 osd.15 [WRN] slow request 30.191760 seconds
 old, received at 2017-10-14 02:30:43.687351:
 osd_op(client.6198346.0:932120 0.19e 0.3e2de99e (undecoded)
 ondisk+read+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:14.879389 osd.15 [WRN] 48 slow requests, 4 included
 below; oldest blocked for > 50.970509 secs
 2017-10-14 02:31:14.879396 osd.15 [WRN] slow request 30.787035 seconds
 old, received at 2017-10-14 02:30:44.092271:
 osd_op(client.6210820.0:962790 0.4f 0.925ece4f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:14.879401 osd.15 [WRN] slow request 30.461304 seconds
 old, received at 2017-10-14 02:30:44.418002:
 osd_op(client.6198346.0:932145 0.4f 0.2882a64f (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:14.879404 osd.15 [WRN] slow request 30.461200 seconds
 old, received at 2017-10-14 02:30:44.418106:
 osd_op(client.6198346.0:932146 0.19e 0.53a7419e (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg
 2017-10-14 02:31:14.879408 osd.15 [WRN] slow request 30.787593 seconds
 old, received at 2017-10-14 02:30:44.091713:
 osd_op(client.6210820.0:962779 0.167 0.79816d67 (undecoded)
 ondisk+write+known_if_redirected e18866) currently queued_for_pg

 Even if it were possible to restrict the scrubs to low-usage hours,
 running scrubs for 8 hours a day at 45 minutes each with 512 PGs would
 take about 48 days to cycle through.

 There's no evidence of any kind supporting any type of hardware issue.

 So, what's going on here?  There's fairly obviously something
 seriously wrong with this cluster (rather than with Ceph generally),
 but where should I look to run this down further?

 Thanks for any advice!
 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

If you storing small object sizes then i would recommend lowering disk read ahead cache if it was larger, for example limit it to 128M: 
echo 128 > /sys/block/sdX/queue/read_ahead_kb
The other thing is to check/set you are using cfq so you can set io priorities. 
echo cfq > /sys/block/sdX/queue/scheduler
and double check the kernel is not using the newer blk-mq wih scsi 
cat /proc/config.gz | gunzip | grep CONFIG_SCSI_MQ_DEFAULT
if this is set to true you should disable it via boot option 
scsi_mod.use_blk_mq=0
Lastly i would not like too much on the scrub conf parameters since it seems there was earlier discussion on those, but i would also look into:
osd_scrub_priority = 1
osd_scrub_sleep = 2
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
osd_scrub_load_threshold = 3
Maged

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com