Re: Ceph cluster NO read / write performance :: Ops are blocked

Lincoln Bryant <lincolnb@xxxxxxxxxxxx> · Mon, 7 Sep 2015 11:39:08 -0500

    Hi Vickey,

      I had this exact same problem last week, resolved by rebooting all
      of my OSD nodes. I have yet to figure out why it happened, though.
      I _suspect_ in my case it's due to a failing controller on a
      particular box I've had trouble with in the past.

      I tried setting 'noout', stopping my OSDs one host at a time, then
      rerunning RADOS bench between to see if I could nail down the
      problematic machine. Depending on your # of hosts, this might work
      for you. Admittedly, I got impatient with this approach though and
      just ended up restarting everything (which worked!) :) 

      If you have a bunch of blocked ops, you could maybe try a 'pg
      query' on the PGs involved and see if there's a common OSD with
      all of your blocked ops. In my experience, it's not necessarily
      the one reporting.

      Anecdotally, I've had trouble with Intel 10Gb NICs and custom
      kernels as well. I've seen a NIC appear to be happy (no message in
      dmesg, machine appears to be communicating normally, etc) but when
      I went to iperf it, I was getting super pitiful performance (like
      KB/s). I don't know what kind of NICs you're using, but you may
      want to iperf everything just in case. 

      --Lincoln

      On 9/7/2015 9:36 AM, Vickey Singh wrote:

      Dear Experts

Can someone please help me , why my cluster is not able write data.

See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.

Ceph Hammer  0.94.2
CentOS 6 (3.10.69-1)

The Ceph status says OPS are blocked , i have tried checking , what all i
know

- System resources ( CPU , net, disk , memory )    -- All normal
- 10G network for public and cluster network  -- no saturation
- Add disks are physically healthy
- No messages in /var/log/messages OR dmesg
- Tried restarting OSD which are blocking operation , but no luck
- Tried writing through RBD  and Rados bench , both are giving same problemm

Please help me to fix this problem.

#  rados bench -p rbd 60 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or
0 objects
 Object prefix: benchmark_data_stor1_1791844
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       125       109   435.873       436  0.022076 0.0697864
     2      16       139       123   245.948        56  0.246578 0.0674407
     3      16       139       123   163.969         0         - 0.0674407
     4      16       139       123   122.978         0         - 0.0674407
     5      16       139       123    98.383         0         - 0.0674407
     6      16       139       123   81.9865         0         - 0.0674407
     7      16       139       123   70.2747         0         - 0.0674407
     8      16       139       123   61.4903         0         - 0.0674407
     9      16       139       123   54.6582         0         - 0.0674407
    10      16       139       123   49.1924         0         - 0.0674407
    11      16       139       123   44.7201         0         - 0.0674407
    12      16       139       123   40.9934         0         - 0.0674407
    13      16       139       123   37.8401         0         - 0.0674407
    14      16       139       123   35.1373         0         - 0.0674407
    15      16       139       123   32.7949         0         - 0.0674407
    16      16       139       123   30.7451         0         - 0.0674407
    17      16       139       123   28.9364         0         - 0.0674407
    18      16       139       123   27.3289         0         - 0.0674407
    19      16       139       123   25.8905         0         - 0.0674407
2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat:
0.0674407
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      16       139       123    24.596         0         - 0.0674407
    21      16       139       123   23.4247         0         - 0.0674407
    22      16       139       123     22.36         0         - 0.0674407
    23      16       139       123   21.3878         0         - 0.0674407
    24      16       139       123   20.4966         0         - 0.0674407
    25      16       139       123   19.6768         0         - 0.0674407
    26      16       139       123     18.92         0         - 0.0674407
    27      16       139       123   18.2192         0         - 0.0674407
    28      16       139       123   17.5686         0         - 0.0674407
    29      16       139       123   16.9628         0         - 0.0674407
    30      16       139       123   16.3973         0         - 0.0674407
    31      16       139       123   15.8684         0         - 0.0674407
    32      16       139       123   15.3725         0         - 0.0674407
    33      16       139       123   14.9067         0         - 0.0674407
    34      16       139       123   14.4683         0         - 0.0674407
    35      16       139       123   14.0549         0         - 0.0674407
    36      16       139       123   13.6645         0         - 0.0674407
    37      16       139       123   13.2952         0         - 0.0674407
    38      16       139       123   12.9453         0         - 0.0674407
    39      16       139       123   12.6134         0         - 0.0674407
2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat:
0.0674407
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    40      16       139       123   12.2981         0         - 0.0674407
    41      16       139       123   11.9981         0         - 0.0674407

    cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
     health HEALTH_WARN
            1 requests are blocked > 32 sec
     monmap e3: 3 mons at {stor0111=
10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor011
5=10.100.1.115:6789/0}
            election epoch 32, quorum 0,1,2 stor0111,stor0113,stor0115
     osdmap e19536: 50 osds: 50 up, 50 in
      pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects
            91513 GB used, 47642 GB / 135 TB avail
                2752 active+clean

Tried using RBD

# dd if=/dev/zero of=file1 bs=4K count=10000 oflag=direct
10000+0 records in
10000+0 records out
40960000 bytes (41 MB) copied, 24.5529 s, 1.7 MB/s

# dd if=/dev/zero of=file1 bs=1M count=100 oflag=direct
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.05602 s, 9.3 MB/s

# dd if=/dev/zero of=file1 bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 293.551 s, 3.7 MB/s
]#

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com