Re: Ceph cluster NO read / write performance :: Ops are blocked

Vickey Singh <vickey.singh22693@xxxxxxxxx> · Mon, 7 Sep 2015 23:29:45 +0300

On Mon, Sep 7, 2015 at 7:39 PM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:

    Hi Vickey,

Thanks a lot for replying to my problem.

      I had this exact same problem last week, resolved by rebooting all
      of my OSD nodes. I have yet to figure out why it happened, though.
      I _suspect_ in my case it's due to a failing controller on a
      particular box I've had trouble with in the past.

Mine is a 5 node cluster 12 OSD each node and and in past there has never been any hardware problems.  

      I tried setting 'noout', stopping my OSDs one host at a time, then
      rerunning RADOS bench between to see if I could nail down the
      problematic machine. Depending on your # of hosts, this might work
      for you. Admittedly, I got impatient with this approach though and
      just ended up restarting everything (which worked!) :) 

So do you mean you intentionally brought one node's OSD down ? so some OSD were down but none of then were out (no out ) . Then you waited for some time to make cluster healthy , and then you rerun rados bench ??  

      If you have a bunch of blocked ops, you could maybe try a 'pg
      query' on the PGs involved and see if there's a common OSD with
      all of your blocked ops. In my experience, it's not necessarily
      the one reporting.

Yeah, i have 55 OSDs and every time any random OSD shows OPS blocked. So i can't blame any specific OSD. After few minutes that blocked OSD becomes clean and after sometime some other osd blocks ops.

Thanks, i will try to restart all osd / monitor daemons , and see if this fixes. Is there any thing i need to keep in mind to restart osd  ( expect nodown , noout )  ??

      Anecdotally, I've had trouble with Intel 10Gb NICs and custom
      kernels as well. I've seen a NIC appear to be happy (no message in
      dmesg, machine appears to be communicating normally, etc) but when
      I went to iperf it, I was getting super pitiful performance (like
      KB/s). I don't know what kind of NICs you're using, but you may
      want to iperf everything just in case. 

Yeah i did that , iperf shows no problem.

Is there anything else i should do ??

      --Lincoln

      On 9/7/2015 9:36 AM, Vickey Singh wrote:

      Dear Experts

Can someone please help me , why my cluster is not able write data.

See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.

Ceph Hammer  0.94.2
CentOS 6 (3.10.69-1)

The Ceph status says OPS are blocked , i have tried checking , what all i
know

- System resources ( CPU , net, disk , memory )    -- All normal
- 10G network for public and cluster network  -- no saturation
- Add disks are physically healthy
- No messages in /var/log/messages OR dmesg
- Tried restarting OSD which are blocking operation , but no luck
- Tried writing through RBD  and Rados bench , both are giving same problemm

Please help me to fix this problem.

#  rados bench -p rbd 60 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or
0 objects
 Object prefix: benchmark_data_stor1_1791844
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       125       109   435.873       436  0.022076 0.0697864
     2      16       139       123   245.948        56  0.246578 0.0674407
     3      16       139       123   163.969         0         - 0.0674407
     4      16       139       123   122.978         0         - 0.0674407
     5      16       139       123    98.383         0         - 0.0674407
     6      16       139       123   81.9865         0         - 0.0674407
     7      16       139       123   70.2747         0         - 0.0674407
     8      16       139       123   61.4903         0         - 0.0674407
     9      16       139       123   54.6582         0         - 0.0674407
    10      16       139       123   49.1924         0         - 0.0674407
    11      16       139       123   44.7201         0         - 0.0674407
    12      16       139       123   40.9934         0         - 0.0674407
    13      16       139       123   37.8401         0         - 0.0674407
    14      16       139       123   35.1373         0         - 0.0674407
    15      16       139       123   32.7949         0         - 0.0674407
    16      16       139       123   30.7451         0         - 0.0674407
    17      16       139       123   28.9364         0         - 0.0674407
    18      16       139       123   27.3289         0         - 0.0674407
    19      16       139       123   25.8905         0         - 0.0674407
2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat:
0.0674407
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      16       139       123    24.596         0         - 0.0674407
    21      16       139       123   23.4247         0         - 0.0674407
    22      16       139       123     22.36         0         - 0.0674407
    23      16       139       123   21.3878         0         - 0.0674407
    24      16       139       123   20.4966         0         - 0.0674407
    25      16       139       123   19.6768         0         - 0.0674407
    26      16       139       123     18.92         0         - 0.0674407
    27      16       139       123   18.2192         0         - 0.0674407
    28      16       139       123   17.5686         0         - 0.0674407
    29      16       139       123   16.9628         0         - 0.0674407
    30      16       139       123   16.3973         0         - 0.0674407
    31      16       139       123   15.8684         0         - 0.0674407
    32      16       139       123   15.3725         0         - 0.0674407
    33      16       139       123   14.9067         0         - 0.0674407
    34      16       139       123   14.4683         0         - 0.0674407
    35      16       139       123   14.0549         0         - 0.0674407
    36      16       139       123   13.6645         0         - 0.0674407
    37      16       139       123   13.2952         0         - 0.0674407
    38      16       139       123   12.9453         0         - 0.0674407
    39      16       139       123   12.6134         0         - 0.0674407
2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat:
0.0674407
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    40      16       139       123   12.2981         0         - 0.0674407
    41      16       139       123   11.9981         0         - 0.0674407

    cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
     health HEALTH_WARN
            1 requests are blocked > 32 sec
     monmap e3: 3 mons at {stor0111=
10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor011
5=10.100.1.115:6789/0}
            election epoch 32, quorum 0,1,2 stor0111,stor0113,stor0115
     osdmap e19536: 50 osds: 50 up, 50 in
      pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects
            91513 GB used, 47642 GB / 135 TB avail
                2752 active+clean

Tried using RBD

# dd if=/dev/zero of=file1 bs=4K count=10000 oflag=direct
10000+0 records in
10000+0 records out
40960000 bytes (41 MB) copied, 24.5529 s, 1.7 MB/s

# dd if=/dev/zero of=file1 bs=1M count=100 oflag=direct
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.05602 s, 9.3 MB/s

# dd if=/dev/zero of=file1 bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 293.551 s, 3.7 MB/s
]#

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com