Re: Frozen Client Mounts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Diego,

you can see the network connection as your HDD cables.

So if you get interruptions there, its like you are pulling out the HDD
cables of your server/computer and putting it back.

You can just check easily how much your server/computer will like that
with your local HDD's ;-)

----

And no, ceph will not protect you from this.

If the requested data is on a PG / OSD which will receive a network
interrupt, you will get IO Errors.

The question is what the OS of the VM will do with that. Maybe it will
mount the whole HDD read-only.

Maybe it will just throw some errors until its good again.

Maybe you will have a stale/freeze until its good again.

Maybe .....

In any case, a stable network connection is the absolute basic
requirement for network storage. If your cloud environment cant provide
that, you cant provide stable services.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 01.04.2016 um 17:31 schrieb Diego Castro:
> Hello Oliver, sorry if i wasn't clear at my first post.
> I agree with you that a network issue isn't desirable but should it
> crash mount clients? I mean, doesn't the client be smart enough to retry
> connection or so? 
> My point is cloud environments (public) doesn't have the same
> availability as a local setup, so shouldn't we at least don't freeze the
> clients?
> 
> 
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
> 
> 2016-04-01 12:27 GMT-03:00 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx
> <mailto:info@xxxxxxxxxxxxxxxxx>>:
> 
>     Hi Diego,
> 
>     ok so this is a new case scenario.
> 
>     Before you said its "until i put some load on it".
> 
>     Now you say, you can't reproduce it and mention that it happends during
>     a (known) network maintenance.
> 
>     So i agree with you, we can assume that your problems were based on
>     network issues.
> 
>     Thats also was your logs implies:
> 
>     "failed lossy con, dropping message"
> 
>     --
>     Mit freundlichen Gruessen / Best regards
> 
>     Oliver Dzombic
>     IP-Interactive
> 
>     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
> 
>     Anschrift:
> 
>     IP Interactive UG ( haftungsbeschraenkt )
>     Zum Sonnenberg 1-3
>     63571 Gelnhausen
> 
>     HRB 93402 beim Amtsgericht Hanau
>     Geschäftsführung: Oliver Dzombic
> 
>     Steuer Nr.: 35 236 3622 1
>     UST ID: DE274086107
> 
> 
>     Am 01.04.2016 um 14:07 schrieb Diego Castro:
>     > Hello Oliver, this issue showed very hard to reproduce, i couldn't make
>     > it again.
>     > My best guess is something with the Azure's network since last week
>     > (when happened a lot) there were a ongoing maintenance.
>     >
>     > Here's  the outputs:
>     >
>     > $ ceph -s
>     >     cluster 25736883-dbf1-4d7a-8796-50e36f9de7a6
>     >      health HEALTH_OK
>     >      monmap e1: 4 mons at
>     > {osmbr0=10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0
>     <http://10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0>
>     >
>     <http://10.0.3.4:6789/0,osmbr1=10.0.3.6:6789/0,osmbr2=10.0.3.14:6789/0,osmbr3=10.0.3.7:6789/0>}
>     >             election epoch 602, quorum 0,1,2,3
>     osmbr0,osmbr1,osmbr3,osmbr2
>     >      osdmap e1816: 10 osds: 10 up, 10 in
>     >       pgmap v3158931: 128 pgs, 1 pools, 11512 MB data, 3522 objects
>     >             34959 MB used, 10195 GB / 10229 GB avail
>     >                  128 active+clean
>     >   client io 87723 B/s wr, 8 op/s
>     >
>     > $ ceph osd df
>     > ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR
>     >  6 1.00000  1.00000  1022G  3224M  1019G 0.31 0.92
>     >  1 1.00000  1.00000  1022G  3489M  1019G 0.33 1.00
>     >  2 1.00000  1.00000  1022G  3945M  1019G 0.38 1.13
>     >  4 1.00000  1.00000  1022G  3304M  1019G 0.32 0.95
>     >  7 1.00000  1.00000  1022G  3427M  1019G 0.33 0.98
>     >  3 1.00000  1.00000  1022G  4361M  1018G 0.42 1.25
>     >  9 1.00000  1.00000  1022G  3650M  1019G 0.35 1.04
>     >  0 1.00000  1.00000  1022G  3210M  1019G 0.31 0.92
>     >  5 1.00000  1.00000  1022G  3577M  1019G 0.34 1.02
>     >  8 1.00000  1.00000  1022G  2765M  1020G 0.26 0.79
>     >               TOTAL 10229G 34957M 10195G 0.33
>     > MIN/MAX VAR: 0.79/1.25  STDDEV: 0.04
>     >
>     >
>     >
>     > $ ceph osd perf
>     > osd fs_commit_latency(ms) fs_apply_latency(ms)
>     >   0                     1                    2
>     >   1                     1                    2
>     >   2                     2                    3
>     >   3                     2                    3
>     >   4                     1                    2
>     >   5                     2                    3
>     >   6                     1                    2
>     >   7                     2                    3
>     >   8                     1                    2
>     >   9                     1                    1
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > ---
>     > Diego Castro / The CloudFather
>     > GetupCloud.com - Eliminamos a Gravidade
>     >
>     > 2016-03-31 18:00 GMT-03:00 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     > <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>:
>     >
>     >     Hi Diego,
>     >
>     >     lets start with the basics and please give us the output of
>     >
>     >     ceph -s
>     >     ceph osd df
>     >     ceph osd perf
>     >
>     >     at best before and after you provike the iowait.
>     >
>     >     Thank you !
>     >
>     >     --
>     >     Mit freundlichen Gruessen / Best regards
>     >
>     >     Oliver Dzombic
>     >     IP-Interactive
>     >
>     >     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >
>     >     Anschrift:
>     >
>     >     IP Interactive UG ( haftungsbeschraenkt )
>     >     Zum Sonnenberg 1-3
>     >     63571 Gelnhausen
>     >
>     >     HRB 93402 beim Amtsgericht Hanau
>     >     Geschäftsführung: Oliver Dzombic
>     >
>     >     Steuer Nr.: 35 236 3622 1
>     >     UST ID: DE274086107
>     >
>     >
>     >     Am 31.03.2016 um 21:38 schrieb Diego Castro:
>     >     > Hello, everyone.
>     >     > I have a pretty basic ceph setup running on top of Azure
>     Cloud, (4 mons
>     >     > and 10 osd's) for rbd images.
>     >     > Everything seems to be working as expected until i put some
>     load on it,
>     >     > sometimes it doesn't complete the process (mysql restore for
>     ex.) and
>     >     > sometimes it does without any issues.
>     >     >
>     >     >
>     >     > Client Kernel: 3.10.0-327.10.1.el7.x86_64
>     >     > OSD Kernel: 3.10.0-229.7.2.el7.x86_64
>     >     >
>     >     > Ceph: ceph-0.94.5-0.el7.x86_64
>     >     >
>     >     > On the client side, i have 100%iowait, a lot of "INFO: task
>     blocked for
>     >     > more than 120 seconds"
>     >     > On the osd side, i have no evidences of faulty disk or
>     read/write
>     >     > latency, but i found the following messages:
>     >     >
>     >     >
>     >     > 2016-03-28 17:04:03.425249 7f7329fc5700  0 bad crc in data
>     641367213 !=
>     >     > exp 3107019767
>     >     > 2016-03-28 17:04:03.440599 7f7329fc5700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> >> 10.0.2.5:0/1998047321
>     <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>
>     >     > <http://10.0.2.5:0/1998047321> pipe(0x13cc4800 sd=54 :6800
>     s=0 pgs=0
>     >     > cs=0 l=0 c=0x13883f40).accept peer addr is really
>     10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321>
>     <http://10.0.2.5:0/1998047321>
>     >     > <http://10.0.2.5:0/1998047321> (socket is 10.0.2.5:34702/0
>     <http://10.0.2.5:34702/0>
>     >     <http://10.0.2.5:34702/0>
>     >     > <http://10.0.2.5:34702/0>)
>     >     > 2016-03-28 17:04:03.487497 7f7333e6a700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20046
>     >     > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size
>     4194304
>     >     > write_size 4194304,write 0~524288] v1753'32512 uv32512
>     ondisk = 0) v6
>     >     > remote, 10.0.2.5:0/1998047321 <http://10.0.2.5:0/1998047321>
>     <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>, failed
>     >     > lossy con, dropping message 0x12b539c0
>     >     > 2016-03-28 17:04:03.532302 7f733666f700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20047
>     >     > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size
>     4194304
>     >     > write_size 4194304,write 524288~524288] v1753'32513 uv32513
>     ondisk = 0)
>     >     > v6 remote, 10.0.2.5:0/1998047321
>     <http://10.0.2.5:0/1998047321> <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>, failed
>     >     > lossy con, dropping message 0x1667bc80
>     >     > 2016-03-28 17:04:03.535143 7f7333e6a700  0 --
>     10.0.3.9:6800/2272 <http://10.0.3.9:6800/2272>
>     <http://10.0.3.9:6800/2272>
>     >     > <http://10.0.3.9:6800/2272> submit_message osd_op_reply(20048
>     >     > rb.0.6040.238e1f29.000000000074 [set-alloc-hint object_size
>     4194304
>     >     > write_size 4194304,write 1048576~524288] v1753'32514 uv32514
>     ondisk = 0)
>     >     > v6 remote, 10.0.2.5:0/1998047321
>     <http://10.0.2.5:0/1998047321> <http://10.0.2.5:0/1998047321>
>     >     <http://10.0.2.5:0/1998047321>, failed
>     >     > lossy con, dropping message 0x12b56e00
>     >     >
>     >     > ---
>     >     > Diego Castro / The CloudFather
>     >     > GetupCloud.com - Eliminamos a Gravidade
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > ceph-users mailing list
>     >     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
>     >
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux