Re: Ceph cluster NO read / write performance :: Ops are blocked

Nick Fisk <nick@xxxxxxxxxx> · Thu, 17 Sep 2015 17:59:11 +0100

Ah right....this is where it gets interesting.

You are probably hitting a cache full on a PG somewhere which is either making everything wait until it flushes or something like that. 

What cache settings have you got set?

I assume you have SSD's for the cache tier? Can you share the size of the pool.

If possible could you also create a non tiered test pool and do some benchmarks on that to rule out any issue with the hardware and OSD's.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Lincoln Bryant
> Sent: 17 September 2015 17:54
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph cluster NO read / write performance :: Ops
> are blocked
> 
> Hi Nick,
> 
> Thanks for responding. Yes, I am.
> 
> —Lincoln
> 
> > On Sep 17, 2015, at 11:53 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >
> > You are getting a fair amount of reads on the disks whilst doing these
> writes. You're not using cache tiering are you?
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Lincoln Bryant
> >> Sent: 17 September 2015 17:42
> >> To: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  Ceph cluster NO read / write performance ::
> >> Ops are blocked
> >>
> >> Hello again,
> >>
> >> Well, I disabled offloads on the NIC -- didn’t work for me. I also
> >> tried setting net.ipv4.tcp_moderate_rcvbuf = 0 as suggested elsewhere
> >> in the thread to no avail.
> >>
> >> Today I was watching iostat on an OSD box ('iostat -xm 5') when the
> >> cluster got into “slow” state:
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-
> sz
> >> await  svctm  %util
> >> sdb               0.00    13.57   84.23  167.47     0.45     2.78    26.26     2.06    8.18
> 3.85
> >> 96.93
> >> sdc               0.00    46.71    5.59  289.22     0.03     2.54    17.85     3.18   10.77
> 0.97
> >> 28.72
> >> sdd               0.00    16.57   45.11   91.62     0.25     0.55    12.01     0.75    5.51
> 2.45
> >> 33.47
> >> sde               0.00    13.57    6.99  143.31     0.03     2.53    34.97     1.99   13.27
> 2.12
> >> 31.86
> >> sdf               0.00    18.76    4.99  158.48     0.10     1.09    14.88     1.26    7.69   1.24
> >> 20.26
> >> sdg               0.00    25.55   81.64  237.52     0.44     2.89    21.36     4.14   12.99
> 2.58
> >> 82.22
> >> sdh               0.00    89.42   16.17  492.42     0.09     3.81    15.69    17.12   33.66
> 0.73
> >> 36.95
> >> sdi               0.00    20.16   17.76  189.62     0.10     1.67    17.46     3.45   16.63
> 1.57
> >> 32.55
> >> sdj               0.00    31.54    0.00  185.23     0.00     1.91    21.15     3.33   18.00
> 0.03
> >> 0.62
> >> sdk               0.00    26.15    2.40  133.33     0.01     0.84    12.79     1.07    7.87
> 0.85
> >> 11.58
> >> sdl               0.00    25.55    9.38  123.95     0.05     1.15    18.44     0.50    3.74   1.58
> >> 21.10
> >> sdm               0.00     6.39   92.61   47.11     0.47     0.26    10.65     1.27    9.07
> 6.92
> >> 96.73
> >>
> >> The %util is rather high on some disks, but I’m not an expert at
> >> looking at iostat so I’m not sure how worrisome this is. Does
> >> anything here stand out to anyone?
> >>
> >> At the time of that iostat, Ceph was reporting a lot of blocked ops
> >> on the OSD associated with sde (as well as about 30 other OSDs), but
> >> it doesn’t look all that busy. Some simple ‘dd’ tests seem to indicate the
> disk is fine.
> >>
> >> Similarly, iotop seems OK on this host:
> >>
> >>  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
> >> 472477 be/4 root        0.00 B/s    5.59 M/s  0.00 %  0.57 % ceph-osd -i 111 --
> pid-
> >> file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 470621 be/4 root        0.00 B/s   10.09 M/s  0.00 %  0.40 % ceph-osd -i 111 --
> pid-
> >> file /var/run/ceph/osd.111.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3495447 be/4 root        0.00 B/s  272.19 K/s  0.00 %  0.36 % ceph-osd -i 114 --
> >> pid-file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3488389 be/4 root	 0.00 B/s  596.80 K/s  0.00 %  0.16 % ceph-osd -i 109 --
> >> pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3488060 be/4 root        0.00 B/s  600.83 K/s  0.00 %  0.15 % ceph-osd -i 108 --
> >> pid-file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3505573 be/4 root        0.00 B/s  528.25 K/s  0.00 %  0.10 % ceph-osd -i 119 --
> >> pid-file /var/run/ceph/osd.119.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3495434 be/4 root        0.00 B/s    2.02 K/s  0.00 %  0.10 % ceph-osd -i 114 --
> pid-
> >> file /var/run/ceph/osd.114.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3502327 be/4 root        0.00 B/s  506.07 K/s  0.00 %  0.09 % ceph-osd -i 118 --
> >> pid-file /var/run/ceph/osd.118.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3489100 be/4 root        0.00 B/s  106.86 K/s  0.00 %  0.09 % ceph-osd -i 110 --
> >> pid-file /var/run/ceph/osd.110.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3496631 be/4 root        0.00 B/s  229.85 K/s  0.00 %  0.05 % ceph-osd -i 115 --
> >> pid-file /var/run/ceph/osd.115.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3505561 be/4 root	 0.00 B/s    2.02 K/s  0.00 %  0.03 % ceph-osd -i 119 --
> >> pid-file /var/run/ceph/osd.119.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3488059 be/4 root        0.00 B/s    2.02 K/s  0.00 %  0.03 % ceph-osd -i 108 --
> pid-
> >> file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3488391 be/4 root       46.37 K/s  431.47 K/s  0.00 %  0.02 % ceph-osd -i 109 -
> -
> >> pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3500639 be/4 root        0.00 B/s  221.78 K/s  0.00 %  0.02 % ceph-osd -i 117 --
> >> pid-file /var/run/ceph/osd.117.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3488392 be/4 root       34.28 K/s  185.49 K/s  0.00 %  0.02 % ceph-osd -i 109 -
> -
> >> pid-file /var/run/ceph/osd.109.pid -c /etc/ceph/ceph.conf --cluster ceph
> >> 3488062 be/4 root        4.03 K/s   66.54 K/s  0.00 %  0.02 % ceph-osd -i 108 --
> pid-
> >> file /var/run/ceph/osd.108.pid -c /etc/ceph/ceph.conf --cluster ceph
> >>
> >> These are all 6TB seagates in single-disk RAID 0 on a PERC H730 Mini
> >> controller.
> >>
> >> I did try removing the disk with 20k non-medium errors, but that
> >> didn’t seem to help.
> >>
> >> Thanks for any insight!
> >>
> >> Cheers,
> >> Lincoln Bryant
> >>
> >>> On Sep 9, 2015, at 1:09 PM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx>
> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>> I’ll take a look at all of those things and report back (hopefully
> >>> :))
> >>>
> >>> I did try setting all of my OSDs to writethrough instead of
> >>> writeback on the
> >> controller, which was significantly more consistent in performance
> >> (from 1100MB/s down to 300MB/s, but still occasionally dropping to
> >> 0MB/s). Still plenty of blocked ops.
> >>>
> >>> I was wondering if not-so-nicely failing OSD(s) might be the cause.
> >>> My
> >> controller (PERC H730 Mini) seems frustratingly terse with SMART
> >> information, but at least one disk has a “Non-medium error count” of
> >> over 20,000..
> >>>
> >>> I’ll try disabling offloads as well.
> >>>
> >>> Thanks much for the suggestions!
> >>>
> >>> Cheers,
> >>> Lincoln
> >>>
> >>>> On Sep 9, 2015, at 3:59 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> >>>>
> >>>> Just to recapitulate - the nodes are doing "nothing" when it drops to
> zero?
> >> Not flushing something to drives (iostat)? Not cleaning pagecache
> >> (kswapd and similiar)? Not out of any type of memory (slab,
> >> min_free_kbytes)? Not network link errors, no bad checksums (those are
> hard to spot, though)?
> >>>>
> >>>> Unless you find something I suggest you try disabling offloads on
> >>>> the NICs
> >> and see if the problem goes away.
> >>>>
> >>>> Jan
> >>>>
> >>>>> On 08 Sep 2015, at 18:26, Lincoln Bryant <lincolnb@xxxxxxxxxxxx>
> wrote:
> >>>>>
> >>>>> For whatever it’s worth, my problem has returned and is very
> >>>>> similar to
> >> yours. Still trying to figure out what’s going on over here.
> >>>>>
> >>>>> Performance is nice for a few seconds, then goes to 0. This is a
> >>>>> similar setup to yours (12 OSDs per box, Scientific Linux 6, Ceph
> >>>>> 0.94.3, etc)
> >>>>>
> >>>>> 384      16     29520     29504   307.287      1188 0.0492006  0.208259
> >>>>> 385      16     29813     29797   309.532      1172 0.0469708  0.206731
> >>>>> 386      16     30105     30089   311.756      1168 0.0375764  0.205189
> >>>>> 387      16     30401     30385   314.009      1184  0.036142  0.203791
> >>>>> 388      16     30695     30679   316.231      1176 0.0372316  0.202355
> >>>>> 389      16     30987     30971    318.42      1168 0.0660476  0.200962
> >>>>> 390      16     31282     31266   320.628      1180 0.0358611  0.199548
> >>>>> 391      16     31568     31552   322.734      1144 0.0405166  0.198132
> >>>>> 392      16     31857     31841   324.859      1156 0.0360826  0.196679
> >>>>> 393      16     32090     32074   326.404       932 0.0416869   0.19549
> >>>>> 394      16     32205     32189   326.743       460 0.0251877  0.194896
> >>>>> 395      16     32302     32286   326.897       388 0.0280574  0.194395
> >>>>> 396      16     32348     32332   326.537       184 0.0256821  0.194157
> >>>>> 397      16     32385     32369   326.087       148 0.0254342  0.193965
> >>>>> 398      16     32424     32408   325.659       156 0.0263006  0.193763
> >>>>> 399      16     32445     32429   325.054        84 0.0233839  0.193655
> >>>>> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat:
> >> 0.193655
> >>>>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >>>>> 400      16     32445     32429   324.241         0         -  0.193655
> >>>>> 401      16     32445     32429   323.433         0         -  0.193655
> >>>>> 402      16     32445     32429   322.628         0         -  0.193655
> >>>>> 403      16     32445     32429   321.828         0         -  0.193655
> >>>>> 404      16     32445     32429   321.031         0         -  0.193655
> >>>>> 405      16     32445     32429   320.238         0         -  0.193655
> >>>>> 406      16     32445     32429    319.45         0         -  0.193655
> >>>>> 407      16     32445     32429   318.665         0         -  0.193655
> >>>>>
> >>>>> needless to say, very strange.
> >>>>>
> >>>>> —Lincoln
> >>>>>
> >>>>>
> >>>>>> On Sep 7, 2015, at 3:35 PM, Vickey Singh
> >> <vickey.singh22693@xxxxxxxxx> wrote:
> >>>>>>
> >>>>>> Adding ceph-users.
> >>>>>>
> >>>>>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh
> >> <vickey.singh22693@xxxxxxxxx> wrote:
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke
> >> <ulembke@xxxxxxxxxxxx> wrote:
> >>>>>> Hi Vickey,
> >>>>>> Thanks for your time in replying to my problem.
> >>>>>>
> >>>>>> I had the same rados bench output after changing the motherboard
> >>>>>> of
> >> the monitor node with the lowest IP...
> >>>>>> Due to the new mainboard, I assume the hw-clock was wrong during
> >> startup. Ceph health show no errors, but all VMs aren't able to do IO
> >> (very high load on the VMs - but no traffic).
> >>>>>> I stopped the mon, but this don't changed anything. I had to
> >>>>>> restart all
> >> other mons to get IO again. After that I started the first mon also
> >> (with the right time now) and all worked fine again...
> >>>>>>
> >>>>>> Thanks i will try to restart all OSD / MONS and report back , if
> >>>>>> it solves my problem
> >>>>>>
> >>>>>> Another posibility:
> >>>>>> Do you use journal on SSDs? Perhaps the SSDs can't write to
> >>>>>> garbage
> >> collection?
> >>>>>>
> >>>>>> No i don't have journals on SSD , they are on the same OSD disk.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Udo
> >>>>>>
> >>>>>>
> >>>>>> On 07.09.2015 16:36, Vickey Singh wrote:
> >>>>>>> Dear Experts
> >>>>>>>
> >>>>>>> Can someone please help me , why my cluster is not able write
> data.
> >>>>>>>
> >>>>>>> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
> >>>>>>>
> >>>>>>>
> >>>>>>> Ceph Hammer  0.94.2
> >>>>>>> CentOS 6 (3.10.69-1)
> >>>>>>>
> >>>>>>> The Ceph status says OPS are blocked , i have tried checking ,
> >>>>>>> what all i know
> >>>>>>>
> >>>>>>> - System resources ( CPU , net, disk , memory )    -- All normal
> >>>>>>> - 10G network for public and cluster network  -- no saturation
> >>>>>>> - Add disks are physically healthy
> >>>>>>> - No messages in /var/log/messages OR dmesg
> >>>>>>> - Tried restarting OSD which are blocking operation , but no
> >>>>>>> luck
> >>>>>>> - Tried writing through RBD  and Rados bench , both are giving
> >>>>>>> same problemm
> >>>>>>>
> >>>>>>> Please help me to fix this problem.
> >>>>>>>
> >>>>>>> #  rados bench -p rbd 60 write
> >>>>>>> Maintaining 16 concurrent writes of 4194304 bytes for up to 60
> >>>>>>> seconds or 0 objects Object prefix:
> benchmark_data_stor1_1791844
> >>>>>>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >>>>>>>  0       0         0         0         0         0         -         0
> >>>>>>>  1      16       125       109   435.873       436  0.022076 0.0697864
> >>>>>>>  2      16       139       123   245.948        56  0.246578 0.0674407
> >>>>>>>  3      16       139       123   163.969         0         - 0.0674407
> >>>>>>>  4      16       139       123   122.978         0         - 0.0674407
> >>>>>>>  5      16       139       123    98.383         0         - 0.0674407
> >>>>>>>  6      16       139       123   81.9865         0         - 0.0674407
> >>>>>>>  7      16       139       123   70.2747         0         - 0.0674407
> >>>>>>>  8      16       139       123   61.4903         0         - 0.0674407
> >>>>>>>  9      16       139       123   54.6582         0         - 0.0674407
> >>>>>>> 10      16       139       123   49.1924         0         - 0.0674407
> >>>>>>> 11      16       139       123   44.7201         0         - 0.0674407
> >>>>>>> 12      16       139       123   40.9934         0         - 0.0674407
> >>>>>>> 13      16       139       123   37.8401         0         - 0.0674407
> >>>>>>> 14      16       139       123   35.1373         0         - 0.0674407
> >>>>>>> 15      16       139       123   32.7949         0         - 0.0674407
> >>>>>>> 16      16       139       123   30.7451         0         - 0.0674407
> >>>>>>> 17      16       139       123   28.9364         0         - 0.0674407
> >>>>>>> 18      16       139       123   27.3289         0         - 0.0674407
> >>>>>>> 19      16       139       123   25.8905         0         - 0.0674407
> >>>>>>> 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg
> lat:
> >> 0.0674407
> >>>>>>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >>>>>>> 20      16       139       123    24.596         0         - 0.0674407
> >>>>>>> 21      16       139       123   23.4247         0         - 0.0674407
> >>>>>>> 22      16       139       123     22.36         0         - 0.0674407
> >>>>>>> 23      16       139       123   21.3878         0         - 0.0674407
> >>>>>>> 24      16       139       123   20.4966         0         - 0.0674407
> >>>>>>> 25      16       139       123   19.6768         0         - 0.0674407
> >>>>>>> 26      16       139       123     18.92         0         - 0.0674407
> >>>>>>> 27      16       139       123   18.2192         0         - 0.0674407
> >>>>>>> 28      16       139       123   17.5686         0         - 0.0674407
> >>>>>>> 29      16       139       123   16.9628         0         - 0.0674407
> >>>>>>> 30      16       139       123   16.3973         0         - 0.0674407
> >>>>>>> 31      16       139       123   15.8684         0         - 0.0674407
> >>>>>>> 32      16       139       123   15.3725         0         - 0.0674407
> >>>>>>> 33      16       139       123   14.9067         0         - 0.0674407
> >>>>>>> 34      16       139       123   14.4683         0         - 0.0674407
> >>>>>>> 35      16       139       123   14.0549         0         - 0.0674407
> >>>>>>> 36      16       139       123   13.6645         0         - 0.0674407
> >>>>>>> 37      16       139       123   13.2952         0         - 0.0674407
> >>>>>>> 38      16       139       123   12.9453         0         - 0.0674407
> >>>>>>> 39      16       139       123   12.6134         0         - 0.0674407
> >>>>>>> 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg
> lat:
> >> 0.0674407
> >>>>>>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >>>>>>> 40      16       139       123   12.2981         0         - 0.0674407
> >>>>>>> 41      16       139       123   11.9981         0         - 0.0674407
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
> >>>>>>>  health HEALTH_WARN
> >>>>>>>         1 requests are blocked > 32 sec  monmap e3: 3 mons at
> >>>>>>> {stor0111=10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor0
> >>>>>>> 11
> >>>>>>> 5=10.100.1.115:6789/0}
> >>>>>>>         election epoch 32, quorum 0,1,2
> >>>>>>> stor0111,stor0113,stor0115  osdmap e19536: 50 osds: 50 up, 50 in
> >>>>>>>   pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects
> >>>>>>>         91513 GB used, 47642 GB / 135 TB avail
> >>>>>>>             2752 active+clean
> >>>>>>>
> >>>>>>>
> >>>>>>> Tried using RBD
> >>>>>>>
> >>>>>>>
> >>>>>>> # dd if=/dev/zero of=file1 bs=4K count=10000 oflag=direct
> >>>>>>> 10000+0 records in
> >>>>>>> 10000+0 records out
> >>>>>>> 40960000 bytes (41 MB) copied, 24.5529 s, 1.7 MB/s
> >>>>>>>
> >>>>>>> # dd if=/dev/zero of=file1 bs=1M count=100 oflag=direct
> >>>>>>> 100+0 records in
> >>>>>>> 100+0 records out
> >>>>>>> 104857600 bytes (105 MB) copied, 1.05602 s, 9.3 MB/s
> >>>>>>>
> >>>>>>> # dd if=/dev/zero of=file1 bs=1G count=1 oflag=direct
> >>>>>>> 1+0 records in
> >>>>>>> 1+0 records out
> >>>>>>> 1073741824 bytes (1.1 GB) copied, 293.551 s, 3.7 MB/s ]#
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list
> >>>>>>>
> >>>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com