Hi Diego, lets start with the basics and please give us the output of ceph -s ceph osd df ceph osd perf at best before and after you provike the iowait. Thank you ! -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 31.03.2016 um 21:38 schrieb Diego Castro: > Hello, everyone. > I have a pretty basic ceph setup running on top of Azure Cloud, (4 mons > and 10 osd's) for rbd images. > Everything seems to be working as expected until i put some load on it, > sometimes it doesn't complete the process (mysql restore for ex.) and > sometimes it does without any issues. > > > Client Kernel: 3.10.0-327.10.1.el7.x86_64 > OSD Kernel: 3.10.0-229.7.2.el7.x86_64 > > Ceph: ceph-0.94.5-0.el7.x86_64 > > On the client side, i have 100%iowait, a lot of "INFO: task blocked for > more than 120 seconds" > On the osd side, i have no evidences of faulty disk or read/write > latency, but i found the following messages: > > > 2016-03-28 17:04:03.425249 7f7329fc5700 0 bad crc in data 641367213 != > exp 3107019767 > 2016-03-28 17:04:03.440599 7f7329fc5700 0 -- 10.0.3.9:6800/2272 > <http://10.0.3.9:6800/2272> >> 10.0.2.5:0/1998047321 > <http://10.0.2.5:0/1998047321> pipe(0x13cc4800 sd=54 :6800 s=0 pgs=0 > cs=0 l=0 c=0x13883f40).accept peer addr is really 10.0.2.5:0/1998047321 > <http://10.0.2.5:0/1998047321> (socket is 10.0.2.5:34702/0 > <http://10.0.2.5:34702/0>) > 2016-03-28 17:04:03.487497 7f7333e6a700 0 -- 10.0.3.9:6800/2272 > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20046 > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size 4194304 > write_size 4194304,write 0~524288] v1753'32512 uv32512 ondisk = 0) v6 > remote, 10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321>, failed > lossy con, dropping message 0x12b539c0 > 2016-03-28 17:04:03.532302 7f733666f700 0 -- 10.0.3.9:6800/2272 > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20047 > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size 4194304 > write_size 4194304,write 524288~524288] v1753'32513 uv32513 ondisk = 0) > v6 remote, 10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321>, failed > lossy con, dropping message 0x1667bc80 > 2016-03-28 17:04:03.535143 7f7333e6a700 0 -- 10.0.3.9:6800/2272 > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20048 > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size 4194304 > write_size 4194304,write 1048576~524288] v1753'32514 uv32514 ondisk = 0) > v6 remote, 10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321>, failed > lossy con, dropping message 0x12b56e00 > > --- > Diego Castro / The CloudFather > GetupCloud.com - Eliminamos a Gravidade > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com