Re: - cluster stuck and undersized if at least one osd is down

Christian Balzer <chibi@xxxxxxx> · Wed, 30 Nov 2016 12:54:43 +0900

Hello,

On Wed, 30 Nov 2016 13:39:50 +1000 Brad Hubbard wrote:

> 
> 
> On Tue, Nov 29, 2016 at 11:37 PM, Piotr Dzionek <piotr.dzionek@xxxxxxxx> wrote:
> > Hi,
> >
> > As far as I understand if I set pool size 2, there is a chance to loose data
> > when another osd dies while there is rebuild ongoing. However, it has to
> > occur on the different host, because my crushmap forbids to store replicas
> > on the same physical node.
> 
> I am not talking about size, I am talking specifically about min_size,
> regardless of size.
> 
> > I am not sure what would change if I set min_size
> > 2, because the only thing I would get is that there is no IOs to objects
> > with less than 2 replicas, while there is rebuild ongoing. And in that case
> 
> Which is exactly what you should want if you are concerened about data
> integrity. If you allow IO via min_size to be served by a single OSD via
> min_size=1 and you tehn lose that OSD you will lose any changes to the pgs
> resulting in inconsistency and missing objects. You are removing the consistency
> guarantees that ceph provides. So if you use min_size=1 you need to be
> comfortable with the fact that there is a likelihood you *will* lose data at
> some stage. Usually this is not what people implementing ceph want.
> 

I'm using size=2 and min_size=1 on our main cluster.
However that one has OSDs that are backed by RAID10s (so essentially a
replication of 4), thus loosing an OSD (in the permanent disk failure
sense) is very, VERY unlikely.

Choosing a size=2 for any OSDs that are backed by plain disks is shouting
for Murphy to come and punish you for your insolence. 
A min_size of 1 just gives him another chance to smite you.

There used to be a corner case that with a size of 3 and min_size of 2 a
dual disk failure would result in blocked I/Os despite things being
remapped already and new writes going to healthy OSDs.
So people did set min_size to 1 to cover for that.
AFAIK, this has been addressed, but confirmation would be nice.

Christian

> > my vms wouldn't be able to read data from ceph pool. But maybe I got it
> > wrong.
> >
> >
> > W dniu 29.11.2016 o 03:08, Brad Hubbard pisze:
> >
> >>
> >> On Mon, Nov 28, 2016 at 9:54 PM, Piotr Dzionek <piotr.dzionek@xxxxxxxx>
> >> wrote:
> >>>
> >>> Hi,
> >>> I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons, and 12
> >>> osds. I removed default pool and created the following one:
> >>>
> >>> pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> >>> rjenkins pg_num 1024 pgp_num 1024 last_change 126 flags hashpspool
> >>> stripe_width 0
> >>
> >> Do you understand the significance of min_size 1?
> >>
> >> Are you OK with the likelihood of data loss that this value introduces?
> >>
> >>> Cluster is healthy if all osds are up, however if I stop any of the osds,
> >>> it
> >>> becomes stuck and undersized - it is not rebuilding.
> >>>
> >>>      cluster *****
> >>>       health HEALTH_WARN
> >>>              166 pgs degraded
> >>>              108 pgs stuck unclean
> >>>              166 pgs undersized
> >>>              recovery 67261/827220 objects degraded (8.131%)
> >>>              1/12 in osds are down
> >>>       monmap e3: 3 mons at
> >>> {**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*****.146:6789/0}
> >>>              election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03
> >>>       osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs
> >>>              flags sortbitwise
> >>>        pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects
> >>>              2452 GB used, 42231 GB / 44684 GB avail
> >>>              67261/827220 objects degraded (8.131%)
> >>>                   858 active+clean
> >>>                   166 active+undersized+degraded
> >>>
> >>> Replica size is 2 and and I use the following crushmap:
> >>>
> >>> # begin crush map
> >>> tunable choose_local_tries 0
> >>> tunable choose_local_fallback_tries 0
> >>> tunable choose_total_tries 50
> >>> tunable chooseleaf_descend_once 1
> >>> tunable chooseleaf_vary_r 1
> >>> tunable straw_calc_version 1
> >>>
> >>> # devices
> >>> device 0 osd.0
> >>> device 1 osd.1
> >>> device 2 osd.2
> >>> device 3 osd.3
> >>> device 4 osd.4
> >>> device 5 osd.5
> >>> device 6 osd.6
> >>> device 7 osd.7
> >>> device 8 osd.8
> >>> device 9 osd.9
> >>> device 10 osd.10
> >>> device 11 osd.11
> >>>
> >>> # types
> >>> type 0 osd
> >>> type 1 host
> >>> type 2 chassis
> >>> type 3 rack
> >>> type 4 row
> >>> type 5 pdu
> >>> type 6 pod
> >>> type 7 room
> >>> type 8 datacenter
> >>> type 9 region
> >>> type 10 root
> >>>
> >>> # buckets
> >>> host osd01 {
> >>>          id -2           # do not change unnecessarily
> >>>          # weight 14.546
> >>>          alg straw
> >>>          hash 0  # rjenkins1
> >>>          item osd.0 weight 3.636
> >>>          item osd.1 weight 3.636
> >>>          item osd.2 weight 3.636
> >>>          item osd.3 weight 3.636
> >>> }
> >>> host osd02 {
> >>>          id -3           # do not change unnecessarily
> >>>          # weight 14.546
> >>>          alg straw
> >>>          hash 0  # rjenkins1
> >>>          item osd.4 weight 3.636
> >>>          item osd.5 weight 3.636
> >>>          item osd.6 weight 3.636
> >>>          item osd.7 weight 3.636
> >>> }
> >>> host osd03 {
> >>>          id -4           # do not change unnecessarily
> >>>          # weight 14.546
> >>>          alg straw
> >>>          hash 0  # rjenkins1
> >>>          item osd.8 weight 3.636
> >>>          item osd.9 weight 3.636
> >>>          item osd.10 weight 3.636
> >>>          item osd.11 weight 3.636
> >>> }
> >>> root default {
> >>>          id -1           # do not change unnecessarily
> >>>          # weight 43.637
> >>>          alg straw
> >>>          hash 0  # rjenkins1
> >>>          item osd01 weight 14.546
> >>>          item osd02 weight 14.546
> >>>          item osd03 weight 14.546
> >>> }
> >>>
> >>> # rules
> >>> rule replicated_ruleset {
> >>>          ruleset 0
> >>>          type replicated
> >>>          min_size 1
> >>>          max_size 10
> >>>          step take default
> >>>          step chooseleaf firstn 0 type host
> >>>          step emit
> >>> }
> >>>
> >>> # end crush map
> >>>
> >>> I am not sure what is the reason for undersized state. All osd disks are
> >>> the
> >>> same size and replica size is 2. Also data is only replicated per hosts
> >>> basis and I have 3 separate hosts. Maybe number of pg is incorrect ?  Is
> >>> 1024 too big ? or maybe there is some misconfiguration in crushmap ?
> >>>
> >>>
> >>> Kind regards,
> >>> Piotr Dzionek
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>
> >>
> >
> > --
> > Piotr Dzionek
> > System Administrator
> >
> > SEQR Poland Sp. z o.o.
> > ul. Łąkowa 29, 90-554 Łódź, Poland
> > Mobile: +48 796555587
> > Mail: piotr.dzionek@xxxxxxxx
> > www.seqr.com | www.seamless.se
> >
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com