Moving to ceph-users. Ian R. Colle Director of Engineering Inktank Delivering the Future of Storage http://www.linkedin.com/in/ircolle http://www.twitter.com/ircolle Cell: +1.303.601.7713 Email: ian@xxxxxxxxxxx On 4/16/14, 7:52 AM, "Ilya Storozhilov" <Ilya_Storozhilov@xxxxxxxx> wrote: >Hello Ceph developers, > >we encountered a problem with 3 OSDs/3 MONs Ceph cluster (CentOS 6.5. 64 >bit, Ceph "Emperor", v0.72.2 from RPM), which causes errors on RGW-node. >A respective RGW output is: > >2014-04-15 09:50:03.906527 7fcd4c5e6700 1 ====== starting new request >req=0x1811b20 ===== >2014-04-15 09:50:03.908509 7fcd64dfa700 1 Executing get_obj operation >2014-04-15 09:50:03.909138 7fcd64dfa700 0 ERROR: s->cio->print() >returned err=-1 >2014-04-15 09:50:03.909164 7fcd64dfa700 0 ERROR: s->cio->print() >returned err=-1 >2014-04-15 09:50:03.909168 7fcd64dfa700 0 ERROR: s->cio->print() >returned err=-1 >2014-04-15 09:50:03.909171 7fcd64dfa700 0 ERROR: s->cio->print() >returned err=-1 >2014-04-15 09:50:03.909401 7fcd64dfa700 1 ====== req done req=0x16f4c10 >http_status=403 ====== >2014-04-15 09:50:03.909444 7fcd64dfa700 1 ====== starting new request >req=0x1816610 ===== >2014-04-15 09:50:07.444606 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd66bfd700' had timed out after 600 >2014-04-15 09:50:07.444628 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd245a6700' had timed out after 600 >2014-04-15 09:50:07.444634 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd263a9700' had timed out after 600 >2014-04-15 09:50:07.444639 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd28bad700' had timed out after 600 >2014-04-15 09:50:07.444642 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd2a9b0700' had timed out after 600 >2014-04-15 09:50:07.444649 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd349c0700' had timed out after 600 >2014-04-15 09:50:07.444652 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd367c3700' had timed out after 600 >2014-04-15 09:50:07.444668 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd385c6700' had timed out after 600 >2014-04-15 09:50:07.444673 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd3adca700' had timed out after 600 >2014-04-15 09:50:07.444676 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd3b7cb700' had timed out after 600 >2014-04-15 09:50:07.444681 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd3e9d0700' had timed out after 600 >2014-04-15 09:50:07.444685 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd3fdd2700' had timed out after 600 >2014-04-15 09:50:07.444689 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd411d4700' had timed out after 600 >2014-04-15 09:50:07.444694 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd475de700' had timed out after 600 >2014-04-15 09:50:07.444699 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd4a7e3700' had timed out after 600 >2014-04-15 09:50:07.444703 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd4bbe5700' had timed out after 600 >2014-04-15 09:50:07.444708 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd4cfe7700' had timed out after 600 >2014-04-15 09:50:07.444713 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd533f1700' had timed out after 600 >2014-04-15 09:50:07.444717 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd547f3700' had timed out after 600 >2014-04-15 09:50:07.444728 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd55bf5700' had timed out after 600 >2014-04-15 09:50:07.444734 7fce4f6f8700 1 heartbeat_map is_healthy >'RGWProcess::m_tp thread 0x7fcd579f8700' had timed out after 600 >2014-04-15 09:50:10.155239 7fcd2a9b0700 1 ====== req done req=0x15831f0 >http_status=200 ====== > > >An output of the ³ceph -s² command is: > >$ sudo ceph -s > cluster 5c6c95b8-8234-4b34-96a9-facc3c561b72 > health HEALTH_WARN 88 requests are blocked > 32 sec > monmap e2: 3 mons at >{ceph-81=10.44.1.81:6789/0,ceph-82=10.44.1.82:6789/0,ceph-83=10.44.1.83:67 >89/0}, election epoch 22, quorum 0,1,2 ceph-81,ceph-82,ceph-83 > osdmap e519: 3 osds: 3 up, 3 in > pgmap v407538: 296 pgs, 32 pools, 64060 kB data, 280 objects > 17388 MB used, 1058 GB / 1133 GB avail > 296 active+clean > client io 2148 B/s rd, 102 B/s wr, 3 op/s > >Ceph complains with following messages in OSD log-file: > >2014-04-16 02:21:18.065305 osd.0 [WRN] slow request 30.613493 seconds >old, received at 2014-04-16 02:20:47.451757: osd_op(client.5793.0:1270 >default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write >1572864~524288] 172.d7dc6717 e608) v4 currently commit sent >2014-04-16 02:21:18.065309 osd.0 [WRN] slow request 30.611804 seconds >old, received at 2014-04-16 02:20:47.453446: osd_op(client.5793.0:1272 >2014-04-16-02-default.5793.1-s3testqwer2 [append 0~276] 104.2973847e >e608) v4 currently commit sent >2014-04-16 02:21:18.065312 osd.0 [WRN] slow request 30.610639 seconds >old, received at 2014-04-16 02:20:47.454611: osd_op(client.5793.0:1273 >default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write >2097152~524288] 172.d7dc6717 e608) v4 currently commit sent >2014-04-16 02:21:18.065316 osd.0 [WRN] slow request 30.608710 seconds >old, received at 2014-04-16 02:20:47.456540: osd_op(client.5793.0:1274 >default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write >2621440~524288] 172.d7dc6717 e608) v4 currently commit sent >2014-04-16 02:21:18.065319 osd.0 [WRN] slow request 30.605942 seconds >old, received at 2014-04-16 02:20:47.459308: osd_op(client.5793.0:1275 >default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write >3145728~329984] 172.d7dc6717 e608) v4 currently commit sent > >After OSD log-verbosity increase we got the following messages from OSD: > >2014-04-16 03:23:32.337961 7ff93987a700 0 -- 10.44.1.82:6801/2050730 >> >10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45949 s=2 pgs=15930 cs=2649 >l=0 c=0x565a680).fault, initiating reconnect >2014-04-16 03:23:32.338939 7ff93987a700 0 -- 10.44.1.82:6801/2050730 >> >10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45950 s=2 pgs=15931 cs=2651 >l=0 c=0x565a680).fault, initiating reconnect >2014-04-16 03:23:32.339956 7ff93987a700 0 -- 10.44.1.82:6801/2050730 >> >10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45951 s=2 pgs=15932 cs=2653 >l=0 c=0x565a680).fault, initiating reconnect >2014-04-16 03:23:32.341167 7ff93987a700 0 -- 10.44.1.82:6801/2050730 >> >10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45952 s=2 pgs=15933 cs=2655 >l=0 c=0x565a680).fault, initiating reconnect >2014-04-16 03:23:32.342212 7ff93987a700 0 -- 10.44.1.82:6801/2050730 >> >10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45953 s=2 pgs=15934 cs=2657 >l=0 c=0x565a680).fault, initiating reconnect >2014-04-16 03:23:32.343330 7ff93987a700 0 -- 10.44.1.82:6801/2050730 >> >10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45954 s=2 pgs=15935 cs=2659 >l=0 c=0x565a680).fault, initiating reconnect >[14:26:33] Ilya Storozhilov: 2014-04-16 03:25:55.170361 7ff955f87700 0 >log [WRN] : 15 slow requests, 2 included below; oldest blocked for > >33.673854 secs >2014-04-16 03:25:55.170373 7ff955f87700 0 log [WRN] : slow request >33.636609 seconds old, received at 2014-04-16 03:25:21.533670: >osd_op(client.5793.0:5387 >default.5793.2__shadow__MMYl0ajXAfpaBMg55xl6ZVlD7U1lstn_3 [writefull >0~87104] 172.12238d13 e617) v4 currently waiting for subops from [2] >2014-04-16 03:25:55.170380 7ff955f87700 0 log [WRN] : slow request >33.122809 seconds old, received at 2014-04-16 03:25:22.047470: >osd_op(client.5793.0:5394 default.5793.2_myobjects_14 [cmpxattr >user.rgw.idtag (17) op 1 mode 1,create 0~0,delete,setxattr user.rgw.idtag >(17),writefull 0~524288,setxattr user.rgw.manifest (818),setxattr >user.rgw.acl (185),setxattr user.rgw.content_type (25),setxattr >user.rgw.etag (33)] 172.3a483fca e617) v4 currently waiting for subops >from [2] > >This page >http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ >contains the following information about this problem: > >Possible causes include: > > A bad drive (check dmesg output) > A bug in the kernel file system bug (check dmesg output) > An overloaded cluster (check system load, iostat, etc.) > A bug in the ceph-osd daemon. > >Possible solutions: > > Remove VMs Cloud Solutions from Ceph Hosts > Upgrade Kernel > Upgrade Ceph > Restart OSDs > >We checked our environment for bad drive, kernel bug and system load and >found that everything works fine but OSD consumes 100% of CPU when this >error appears. OSD restart did not help. > >Can you help us in some way, e.g. give an advise where to dig or >whatever? Thank you very much! > >Best Regards, > >Ilya Storozhilov > >Lead Software Engineer > > > >EPAM Systems > >Tver office, Russia > >GMT+4 > > > >EPAM Internal ext.: 55529 > >Office phone: +7 (4822) 630-070 ext. 55529 > >Office fax: +7 (4822) 630-073 > >Mobile phone: +7 (904) 021-0763 > >E-mail: Ilya_Storozhilov@xxxxxxxx > > > >http://www.epam.com > > > >CONFIDENTIALITY CAUTION AND DISCLAIMER > >This message is intended only for the use of the individual(s) or >entity(ies) to which it is addressed and contains information that is >legally privileged and confidential. If you are not the intended >recipient, or the person responsible for delivering the > message to the intended recipient, you are hereby notified that any >dissemination, distribution or copying of this communication is strictly >prohibited. All unintended recipients are obliged to delete this message >and destroy any printed copies. > > > > > > > > > > >-- >To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >the body of a message to majordomo@xxxxxxxxxxxxxxx >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html