Hi Vickey,
I had this exact same problem last week, resolved by rebooting all of my OSD nodes. I have yet to figure out why it happened, though. I _suspect_ in my case it's due to a failing controller on a particular box I've had trouble with in the past. I tried setting 'noout', stopping my OSDs one host at a time, then rerunning RADOS bench between to see if I could nail down the problematic machine. Depending on your # of hosts, this might work for you. Admittedly, I got impatient with this approach though and just ended up restarting everything (which worked!) :) If you have a bunch of blocked ops, you could maybe try a 'pg query' on the PGs involved and see if there's a common OSD with all of your blocked ops. In my experience, it's not necessarily the one reporting. Anecdotally, I've had trouble with Intel 10Gb NICs and custom kernels as well. I've seen a NIC appear to be happy (no message in dmesg, machine appears to be communicating normally, etc) but when I went to iperf it, I was getting super pitiful performance (like KB/s). I don't know what kind of NICs you're using, but you may want to iperf everything just in case. --Lincoln On 9/7/2015 9:36 AM, Vickey Singh wrote: Dear Experts Can someone please help me , why my cluster is not able write data. See the below output cur MB/S is 0 and Avg MB/s is decreasing. Ceph Hammer 0.94.2 CentOS 6 (3.10.69-1) The Ceph status says OPS are blocked , i have tried checking , what all i know - System resources ( CPU , net, disk , memory ) -- All normal - 10G network for public and cluster network -- no saturation - Add disks are physically healthy - No messages in /var/log/messages OR dmesg - Tried restarting OSD which are blocking operation , but no luck - Tried writing through RBD and Rados bench , both are giving same problemm Please help me to fix this problem. # rados bench -p rbd 60 write Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects Object prefix: benchmark_data_stor1_1791844 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 125 109 435.873 436 0.022076 0.0697864 2 16 139 123 245.948 56 0.246578 0.0674407 3 16 139 123 163.969 0 - 0.0674407 4 16 139 123 122.978 0 - 0.0674407 5 16 139 123 98.383 0 - 0.0674407 6 16 139 123 81.9865 0 - 0.0674407 7 16 139 123 70.2747 0 - 0.0674407 8 16 139 123 61.4903 0 - 0.0674407 9 16 139 123 54.6582 0 - 0.0674407 10 16 139 123 49.1924 0 - 0.0674407 11 16 139 123 44.7201 0 - 0.0674407 12 16 139 123 40.9934 0 - 0.0674407 13 16 139 123 37.8401 0 - 0.0674407 14 16 139 123 35.1373 0 - 0.0674407 15 16 139 123 32.7949 0 - 0.0674407 16 16 139 123 30.7451 0 - 0.0674407 17 16 139 123 28.9364 0 - 0.0674407 18 16 139 123 27.3289 0 - 0.0674407 19 16 139 123 25.8905 0 - 0.0674407 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 139 123 24.596 0 - 0.0674407 21 16 139 123 23.4247 0 - 0.0674407 22 16 139 123 22.36 0 - 0.0674407 23 16 139 123 21.3878 0 - 0.0674407 24 16 139 123 20.4966 0 - 0.0674407 25 16 139 123 19.6768 0 - 0.0674407 26 16 139 123 18.92 0 - 0.0674407 27 16 139 123 18.2192 0 - 0.0674407 28 16 139 123 17.5686 0 - 0.0674407 29 16 139 123 16.9628 0 - 0.0674407 30 16 139 123 16.3973 0 - 0.0674407 31 16 139 123 15.8684 0 - 0.0674407 32 16 139 123 15.3725 0 - 0.0674407 33 16 139 123 14.9067 0 - 0.0674407 34 16 139 123 14.4683 0 - 0.0674407 35 16 139 123 14.0549 0 - 0.0674407 36 16 139 123 13.6645 0 - 0.0674407 37 16 139 123 13.2952 0 - 0.0674407 38 16 139 123 12.9453 0 - 0.0674407 39 16 139 123 12.6134 0 - 0.0674407 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 40 16 139 123 12.2981 0 - 0.0674407 41 16 139 123 11.9981 0 - 0.0674407 cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8 health HEALTH_WARN 1 requests are blocked > 32 sec monmap e3: 3 mons at {stor0111= 10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor011 5=10.100.1.115:6789/0} election epoch 32, quorum 0,1,2 stor0111,stor0113,stor0115 osdmap e19536: 50 osds: 50 up, 50 in pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects 91513 GB used, 47642 GB / 135 TB avail 2752 active+clean Tried using RBD # dd if=/dev/zero of=file1 bs=4K count=10000 oflag=direct 10000+0 records in 10000+0 records out 40960000 bytes (41 MB) copied, 24.5529 s, 1.7 MB/s # dd if=/dev/zero of=file1 bs=1M count=100 oflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 1.05602 s, 9.3 MB/s # dd if=/dev/zero of=file1 bs=1G count=1 oflag=direct 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 293.551 s, 3.7 MB/s ]# |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com