Hi Diego, you can see the network connection as your HDD cables. So if you get interruptions there, its like you are pulling out the HDD cables of your server/computer and putting it back. You can just check easily how much your server/computer will like that with your local HDD's ;-) ---- And no, ceph will not protect you from this. If the requested data is on a PG / OSD which will receive a network interrupt, you will get IO Errors. The question is what the OS of the VM will do with that. Maybe it will mount the whole HDD read-only. Maybe it will just throw some errors until its good again. Maybe you will have a stale/freeze until its good again. Maybe ..... In any case, a stable network connection is the absolute basic requirement for network storage. If your cloud environment cant provide that, you cant provide stable services. -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 01.04.2016 um 17:31 schrieb Diego Castro: > Hello Oliver, sorry if i wasn't clear at my first post. > I agree with you that a network issue isn't desirable but should it > crash mount clients? I mean, doesn't the client be smart enough to retry > connection or so? > My point is cloud environments (public) doesn't have the same > availability as a local setup, so shouldn't we at least don't freeze the > clients? > > > --- > Diego Castro / The CloudFather > GetupCloud.com - Eliminamos a Gravidade > > 2016-04-01 12:27 GMT-03:00 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx > <mailto:info@xxxxxxxxxxxxxxxxx>>: > > Hi Diego, > > ok so this is a new case scenario. > > Before you said its "until i put some load on it". > > Now you say, you can't reproduce it and mention that it happends during > a (known) network maintenance. > > So i agree with you, we can assume that your problems were based on > network issues. > > Thats also was your logs implies: > > "failed lossy con, dropping message" > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx> > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) > Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 > UST ID: DE274086107 > > > Am 01.04.2016 um 14:07 schrieb Diego Castro: > > Hello Oliver, this issue showed very hard to reproduce, i couldn't make > > it again. > > My best guess is something with the Azure's network since last week > > (when happened a lot) there were a ongoing maintenance. > > > > Here's the outputs: > > > > $ ceph -s > > cluster 25736883-dbf1-4d7a-8796-50e36f9de7a6 > > health HEALTH_OK > > monmap e1: 4 mons at > > {osmbr0=10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0 > <http://10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0> > > > <http://10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0>} > > election epoch 602, quorum 0,1,2,3 > osmbr0,osmbr1,osmbr3,osmbr2 > > osdmap e1816: 10 osds: 10 up, 10 in > > pgmap v3158931: 128 pgs, 1 pools, 11512 MB data, 3522 objects > > 34959 MB used, 10195 GB / 10229 GB avail > > 128 active+clean > > client io 87723 B/s wr, 8 op/s > > > > $ ceph osd df > > ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR > > 6 1.00000 1.00000 1022G 3224M 1019G 0.31 0.92 > > 1 1.00000 1.00000 1022G 3489M 1019G 0.33 1.00 > > 2 1.00000 1.00000 1022G 3945M 1019G 0.38 1.13 > > 4 1.00000 1.00000 1022G 3304M 1019G 0.32 0.95 > > 7 1.00000 1.00000 1022G 3427M 1019G 0.33 0.98 > > 3 1.00000 1.00000 1022G 4361M 1018G 0.42 1.25 > > 9 1.00000 1.00000 1022G 3650M 1019G 0.35 1.04 > > 0 1.00000 1.00000 1022G 3210M 1019G 0.31 0.92 > > 5 1.00000 1.00000 1022G 3577M 1019G 0.34 1.02 > > 8 1.00000 1.00000 1022G 2765M 1020G 0.26 0.79 > > TOTAL 10229G 34957M 10195G 0.33 > > MIN/MAX VAR: 0.79/1.25 STDDEV: 0.04 > > > > > > > > $ ceph osd perf > > osd fs_commit_latency(ms) fs_apply_latency(ms) > > 0 1 2 > > 1 1 2 > > 2 2 3 > > 3 2 3 > > 4 1 2 > > 5 2 3 > > 6 1 2 > > 7 2 3 > > 8 1 2 > > 9 1 1 > > > > > > > > > > > > > > > > --- > > Diego Castro / The CloudFather > > GetupCloud.com - Eliminamos a Gravidade > > > > 2016-03-31 18:00 GMT-03:00 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx > <mailto:info@xxxxxxxxxxxxxxxxx> > > <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>: > > > > Hi Diego, > > > > lets start with the basics and please give us the output of > > > > ceph -s > > ceph osd df > > ceph osd perf > > > > at best before and after you provike the iowait. > > > > Thank you ! > > > > -- > > Mit freundlichen Gruessen / Best regards > > > > Oliver Dzombic > > IP-Interactive > > > > mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx> > <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>> > > > > Anschrift: > > > > IP Interactive UG ( haftungsbeschraenkt ) > > Zum Sonnenberg 1-3 > > 63571 Gelnhausen > > > > HRB 93402 beim Amtsgericht Hanau > > Geschäftsführung: Oliver Dzombic > > > > Steuer Nr.: 35 236 3622 1 > > UST ID: DE274086107 > > > > > > Am 31.03.2016 um 21:38 schrieb Diego Castro: > > > Hello, everyone. > > > I have a pretty basic ceph setup running on top of Azure > Cloud, (4 mons > > > and 10 osd's) for rbd images. > > > Everything seems to be working as expected until i put some > load on it, > > > sometimes it doesn't complete the process (mysql restore for > ex.) and > > > sometimes it does without any issues. > > > > > > > > > Client Kernel: 3.10.0-327.10.1.el7.x86_64 > > > OSD Kernel: 3.10.0-229.7.2.el7.x86_64 > > > > > > Ceph: ceph-0.94.5-0.el7.x86_64 > > > > > > On the client side, i have 100%iowait, a lot of "INFO: task > blocked for > > > more than 120 seconds" > > > On the osd side, i have no evidences of faulty disk or > read/write > > > latency, but i found the following messages: > > > > > > > > > 2016-03-28 17:04:03.425249 7f7329fc5700 0 bad crc in data > 641367213 != > > > exp 3107019767 > > > 2016-03-28 17:04:03.440599 7f7329fc5700 0 -- > 10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272> > <http://10.0.3.9:6800/2272> > > > <http://10.0.3.9:6800/2272> >> 10.0.2.5:0/1998047321 > <http://10.0.2.5:0/1998047321> > > <http://10.0.2.5:0/1998047321> > > > <http://10.0.2.5:0/1998047321> pipe(0x13cc4800 sd=54 :6800 > s=0 pgs=0 > > > cs=0 l=0 c=0x13883f40).accept peer addr is really > 10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321> > <http://10.0.2.5:0/1998047321> > > > <http://10.0.2.5:0/1998047321> (socket is 10.0.2.5:34702/0 > <http://10.0.2.5:34702/0> > > <http://10.0.2.5:34702/0> > > > <http://10.0.2.5:34702/0>) > > > 2016-03-28 17:04:03.487497 7f7333e6a700 0 -- > 10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272> > <http://10.0.3.9:6800/2272> > > > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20046 > > > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size > 4194304 > > > write_size 4194304,write 0~524288] v1753'32512 uv32512 > ondisk = 0) v6 > > > remote, 10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321> > <http://10.0.2.5:0/1998047321> > > <http://10.0.2.5:0/1998047321>, failed > > > lossy con, dropping message 0x12b539c0 > > > 2016-03-28 17:04:03.532302 7f733666f700 0 -- > 10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272> > <http://10.0.3.9:6800/2272> > > > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20047 > > > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size > 4194304 > > > write_size 4194304,write 524288~524288] v1753'32513 uv32513 > ondisk = 0) > > > v6 remote, 10.0.2.5:0/1998047321 > <http://10.0.2.5:0/1998047321> <http://10.0.2.5:0/1998047321> > > <http://10.0.2.5:0/1998047321>, failed > > > lossy con, dropping message 0x1667bc80 > > > 2016-03-28 17:04:03.535143 7f7333e6a700 0 -- > 10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272> > <http://10.0.3.9:6800/2272> > > > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20048 > > > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size > 4194304 > > > write_size 4194304,write 1048576~524288] v1753'32514 uv32514 > ondisk = 0) > > > v6 remote, 10.0.2.5:0/1998047321 > <http://10.0.2.5:0/1998047321> <http://10.0.2.5:0/1998047321> > > <http://10.0.2.5:0/1998047321>, failed > > > lossy con, dropping message 0x12b56e00 > > > > > > --- > > > Diego Castro / The CloudFather > > > GetupCloud.com - Eliminamos a Gravidade > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com