On Mon, Sep 7, 2015 at 7:39 PM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
Hi Vickey,
Thanks a lot for replying to my problem.
I had this exact same problem last week, resolved by rebooting all of my OSD nodes. I have yet to figure out why it happened, though. I _suspect_ in my case it's due to a failing controller on a particular box I've had trouble with in the past.
Mine is a 5 node cluster 12 OSD each node and and in past there has never been any hardware problems.
I tried setting 'noout', stopping my OSDs one host at a time, then rerunning RADOS bench between to see if I could nail down the problematic machine. Depending on your # of hosts, this might work for you. Admittedly, I got impatient with this approach though and just ended up restarting everything (which worked!) :)
So do you mean you intentionally brought one node's OSD down ? so some OSD were down but none of then were out (no out ) . Then you waited for some time to make cluster healthy , and then you rerun rados bench ??
If you have a bunch of blocked ops, you could maybe try a 'pg query' on the PGs involved and see if there's a common OSD with all of your blocked ops. In my experience, it's not necessarily the one reporting.
Yeah, i have 55 OSDs and every time any random OSD shows OPS blocked. So i can't blame any specific OSD. After few minutes that blocked OSD becomes clean and after sometime some other osd blocks ops.
Thanks, i will try to restart all osd / monitor daemons , and see if this fixes. Is there any thing i need to keep in mind to restart osd ( expect nodown , noout ) ??
Anecdotally, I've had trouble with Intel 10Gb NICs and custom kernels as well. I've seen a NIC appear to be happy (no message in dmesg, machine appears to be communicating normally, etc) but when I went to iperf it, I was getting super pitiful performance (like KB/s). I don't know what kind of NICs you're using, but you may want to iperf everything just in case.
Yeah i did that , iperf shows no problem.
Is there anything else i should do ??
--Lincoln
On 9/7/2015 9:36 AM, Vickey Singh wrote:
Dear Experts Can someone please help me , why my cluster is not able write data. See the below output cur MB/S is 0 and Avg MB/s is decreasing. Ceph Hammer 0.94.2 CentOS 6 (3.10.69-1) The Ceph status says OPS are blocked , i have tried checking , what all i know - System resources ( CPU , net, disk , memory ) -- All normal - 10G network for public and cluster network -- no saturation - Add disks are physically healthy - No messages in /var/log/messages OR dmesg - Tried restarting OSD which are blocking operation , but no luck - Tried writing through RBD and Rados bench , both are giving same problemm Please help me to fix this problem. # rados bench -p rbd 60 write Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects Object prefix: benchmark_data_stor1_1791844 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 125 109 435.873 436 0.022076 0.0697864 2 16 139 123 245.948 56 0.246578 0.0674407 3 16 139 123 163.969 0 - 0.0674407 4 16 139 123 122.978 0 - 0.0674407 5 16 139 123 98.383 0 - 0.0674407 6 16 139 123 81.9865 0 - 0.0674407 7 16 139 123 70.2747 0 - 0.0674407 8 16 139 123 61.4903 0 - 0.0674407 9 16 139 123 54.6582 0 - 0.0674407 10 16 139 123 49.1924 0 - 0.0674407 11 16 139 123 44.7201 0 - 0.0674407 12 16 139 123 40.9934 0 - 0.0674407 13 16 139 123 37.8401 0 - 0.0674407 14 16 139 123 35.1373 0 - 0.0674407 15 16 139 123 32.7949 0 - 0.0674407 16 16 139 123 30.7451 0 - 0.0674407 17 16 139 123 28.9364 0 - 0.0674407 18 16 139 123 27.3289 0 - 0.0674407 19 16 139 123 25.8905 0 - 0.0674407 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 139 123 24.596 0 - 0.0674407 21 16 139 123 23.4247 0 - 0.0674407 22 16 139 123 22.36 0 - 0.0674407 23 16 139 123 21.3878 0 - 0.0674407 24 16 139 123 20.4966 0 - 0.0674407 25 16 139 123 19.6768 0 - 0.0674407 26 16 139 123 18.92 0 - 0.0674407 27 16 139 123 18.2192 0 - 0.0674407 28 16 139 123 17.5686 0 - 0.0674407 29 16 139 123 16.9628 0 - 0.0674407 30 16 139 123 16.3973 0 - 0.0674407 31 16 139 123 15.8684 0 - 0.0674407 32 16 139 123 15.3725 0 - 0.0674407 33 16 139 123 14.9067 0 - 0.0674407 34 16 139 123 14.4683 0 - 0.0674407 35 16 139 123 14.0549 0 - 0.0674407 36 16 139 123 13.6645 0 - 0.0674407 37 16 139 123 13.2952 0 - 0.0674407 38 16 139 123 12.9453 0 - 0.0674407 39 16 139 123 12.6134 0 - 0.0674407 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 40 16 139 123 12.2981 0 - 0.0674407 41 16 139 123 11.9981 0 - 0.0674407 cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8 health HEALTH_WARN 1 requests are blocked > 32 sec monmap e3: 3 mons at {stor0111= 10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor011 5=10.100.1.115:6789/0} election epoch 32, quorum 0,1,2 stor0111,stor0113,stor0115 osdmap e19536: 50 osds: 50 up, 50 in pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects 91513 GB used, 47642 GB / 135 TB avail 2752 active+clean Tried using RBD # dd if=/dev/zero of=file1 bs=4K count=10000 oflag=direct 10000+0 records in 10000+0 records out 40960000 bytes (41 MB) copied, 24.5529 s, 1.7 MB/s # dd if=/dev/zero of=file1 bs=1M count=100 oflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 1.05602 s, 9.3 MB/s # dd if=/dev/zero of=file1 bs=1G count=1 oflag=direct 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 293.551 s, 3.7 MB/s ]#
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com