Re: Frozen Client Mounts

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Fri, 1 Apr 2016 18:23:23 +0200

Hi Diego,

you can see the network connection as your HDD cables.

So if you get interruptions there, its like you are pulling out the HDD
cables of your server/computer and putting it back.

You can just check easily how much your server/computer will like that
with your local HDD's ;-)

----

And no, ceph will not protect you from this.

If the requested data is on a PG / OSD which will receive a network
interrupt, you will get IO Errors.

The question is what the OS of the VM will do with that. Maybe it will
mount the whole HDD read-only.

Maybe it will just throw some errors until its good again.

Maybe you will have a stale/freeze until its good again.

Maybe .....

In any case, a stable network connection is the absolute basic
requirement for network storage. If your cloud environment cant provide
that, you cant provide stable services.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 01.04.2016 um 17:31 schrieb Diego Castro:
> Hello Oliver, sorry if i wasn't clear at my first post.
> I agree with you that a network issue isn't desirable but should it
> crash mount clients? I mean, doesn't the client be smart enough to retry
> connection or so? 
> My point is cloud environments (public) doesn't have the same
> availability as a local setup, so shouldn't we at least don't freeze the
> clients?
> 
> 
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
> 
> 2016-04-01 12:27 GMT-03:00 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx
> <mailto:info@xxxxxxxxxxxxxxxxx>>:
> 
>     Hi Diego,
> 
>     ok so this is a new case scenario.
> 
>     Before you said its "until i put some load on it".
> 
>     Now you say, you can't reproduce it and mention that it happends during
>     a (known) network maintenance.
> 
>     So i agree with you, we can assume that your problems were based on
>     network issues.
> 
>     Thats also was your logs implies:
> 
>     "failed lossy con, dropping message"
> 
>     --
>     Mit freundlichen Gruessen / Best regards
> 
>     Oliver Dzombic
>     IP-Interactive
> 
>     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
> 
>     Anschrift:
> 
>     IP Interactive UG ( haftungsbeschraenkt )
>     Zum Sonnenberg 1-3
>     63571 Gelnhausen
> 
>     HRB 93402 beim Amtsgericht Hanau
>     Geschäftsführung: Oliver Dzombic
> 
>     Steuer Nr.: 35 236 3622 1
>     UST ID: DE274086107
> 
> 
>     Am 01.04.2016 um 14:07 schrieb Diego Castro:
>     > Hello Oliver, this issue showed very hard to reproduce, i couldn't make
>     > it again.
>     > My best guess is something with the Azure's network since last week
>     > (when happened a lot) there were a ongoing maintenance.
>     >
>     > Here's  the outputs:
>     >
>     > $ ceph -s
>     >     cluster 25736883-dbf1-4d7a-8796-50e36f9de7a6
>     >      health HEALTH_OK
>     >      monmap e1: 4 mons at
>     > {osmbr0=10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0
>     <http://10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0>
>     >
>     <http://10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0>}
>     >             election epoch 602, quorum 0,1,2,3
>     osmbr0,osmbr1,osmbr3,osmbr2
>     >      osdmap e1816: 10 osds: 10 up, 10 in
>     >       pgmap v3158931: 128 pgs, 1 pools, 11512 MB data, 3522 objects
>     >             34959 MB used, 10195 GB / 10229 GB avail
>     >                  128 active+clean
>     >   client io 87723 B/s wr, 8 op/s
>     >
>     > $ ceph osd df
>     > ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR
>     >  6 1.00000  1.00000  1022G  3224M  1019G 0.31 0.92
>     >  1 1.00000  1.00000  1022G  3489M  1019G 0.33 1.00
>     >  2 1.00000  1.00000  1022G  3945M  1019G 0.38 1.13
>     >  4 1.00000  1.00000  1022G  3304M  1019G 0.32 0.95
>     >  7 1.00000  1.00000  1022G  3427M  1019G 0.33 0.98
>     >  3 1.00000  1.00000  1022G  4361M  1018G 0.42 1.25
>     >  9 1.00000  1.00000  1022G  3650M  1019G 0.35 1.04
>     >  0 1.00000  1.00000  1022G  3210M  1019G 0.31 0.92
>     >  5 1.00000  1.00000  1022G  3577M  1019G 0.34 1.02
>     >  8 1.00000  1.00000  1022G  2765M  1020G 0.26 0.79
>     >               TOTAL 10229G 34957M 10195G 0.33
>     > MIN/MAX VAR: 0.79/1.25  STDDEV: 0.04
>     >
>     >
>     >
>     > $ ceph osd perf
>     > osd fs_commit_latency(ms) fs_apply_latency(ms)
>     >   0                     1                    2
>     >   1                     1                    2
>     >   2                     2                    3
>     >   3                     2                    3
>     >   4                     1                    2
>     >   5                     2                    3
>     >   6                     1                    2
>     >   7                     2                    3
>     >   8                     1                    2
>     >   9                     1                    1
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > ---
>     > Diego Castro / The CloudFather
>     > GetupCloud.com - Eliminamos a Gravidade
>     >
>     > 2016-03-31 18:00 GMT-03:00 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     > <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>:
>     >
>     >     Hi Diego,
>     >
>     >     lets start with the basics and please give us the output of
>     >
>     >     ceph -s
>     >     ceph osd df
>     >     ceph osd perf
>     >
>     >     at best before and after you provike the iowait.
>     >
>     >     Thank you !
>     >
>     >     --
>     >     Mit freundlichen Gruessen / Best regards
>     >
>     >     Oliver Dzombic
>     >     IP-Interactive
>     >
>     >     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >
>     >     Anschrift:
>     >
>     >     IP Interactive UG ( haftungsbeschraenkt )
>     >     Zum Sonnenberg 1-3
>     >     63571 Gelnhausen
>     >
>     >     HRB 93402 beim Amtsgericht Hanau
>     >     Geschäftsführung: Oliver Dzombic
>     >
>     >     Steuer Nr.: 35 236 3622 1
>     >     UST ID: DE274086107
>     >
>     >
>     >     Am 31.03.2016 um 21:38 schrieb Diego Castro:
>     >     > Hello, everyone.
>     >     > I have a pretty basic ceph setup running on top of Azure
>     Cloud, (4 mons
>     >     > and 10 osd's) for rbd images.
>     >     > Everything seems to be working as expected until i put some
>     load on it,
>     >     > sometimes it doesn't complete the process (mysql restore for
>     ex.) and
>     >     > sometimes it does without any issues.
>     >     >
>     >     >
>     >     > Client Kernel: 3.10.0-327.10.1.el7.x86_64
>     >     > OSD Kernel: 3.10.0-229.7.2.el7.x86_64
>     >     >
>     >     > Ceph: ceph-0.94.5-0.el7.x86_64
>     >     >
>     >     > On the client side, i have 100%iowait, a lot of "INFO: task
>     blocked for
>     >     > more than 120 seconds"
>     >     > On the osd side, i have no evidences of faulty disk or
>     read/write
>     >     > latency, but i found the following messages:
>     >     >
>     >     >
>     >     > 2016-03-28 17:04:03.425249 7f7329fc5700  0 bad crc in data
>     641367213 !=
>     >     > exp 3107019767
>     >     > 2016-03-28 17:04:03.440599 7f7329fc5700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> >> 10.0.2.5:0/1998047321
>     <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>
>     >     > <http://10.0.2.5:0/1998047321> pipe(0x13cc4800 sd=54 :6800
>     s=0 pgs=0
>     >     > cs=0 l=0 c=0x13883f40).accept peer addr is really
>     10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321>
>     <http://10.0.2.5:0/1998047321>
>     >     > <http://10.0.2.5:0/1998047321> (socket is 10.0.2.5:34702/0
>     <http://10.0.2.5:34702/0>
>     >     <http://10.0.2.5:34702/0>
>     >     > <http://10.0.2.5:34702/0>)
>     >     > 2016-03-28 17:04:03.487497 7f7333e6a700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20046
>     >     > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size
>     4194304
>     >     > write_size 4194304,write 0~524288] v1753'32512 uv32512
>     ondisk = 0) v6
>     >     > remote, 10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321>
>     <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>, failed
>     >     > lossy con, dropping message 0x12b539c0
>     >     > 2016-03-28 17:04:03.532302 7f733666f700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20047
>     >     > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size
>     4194304
>     >     > write_size 4194304,write 524288~524288] v1753'32513 uv32513
>     ondisk = 0)
>     >     > v6 remote, 10.0.2.5:0/1998047321
>     <http://10.0.2.5:0/1998047321> <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>, failed
>     >     > lossy con, dropping message 0x1667bc80
>     >     > 2016-03-28 17:04:03.535143 7f7333e6a700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20048
>     >     > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size
>     4194304
>     >     > write_size 4194304,write 1048576~524288] v1753'32514 uv32514
>     ondisk = 0)
>     >     > v6 remote, 10.0.2.5:0/1998047321
>     <http://10.0.2.5:0/1998047321> <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>, failed
>     >     > lossy con, dropping message 0x12b56e00
>     >     >
>     >     > ---
>     >     > Diego Castro / The CloudFather
>     >     > GetupCloud.com - Eliminamos a Gravidade
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > ceph-users mailing list
>     >     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
>     >
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com