Re: - cluster stuck and undersized if at least one osd is down

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 30 Nov 2016 14:05:36 +1000

On Wed, Nov 30, 2016 at 1:54 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> On Wed, 30 Nov 2016 13:39:50 +1000 Brad Hubbard wrote:
>
>>
>>
>> On Tue, Nov 29, 2016 at 11:37 PM, Piotr Dzionek <piotr.dzionek@xxxxxxxx> wrote:
>> > Hi,
>> >
>> > As far as I understand if I set pool size 2, there is a chance to loose data
>> > when another osd dies while there is rebuild ongoing. However, it has to
>> > occur on the different host, because my crushmap forbids to store replicas
>> > on the same physical node.
>>
>> I am not talking about size, I am talking specifically about min_size,
>> regardless of size.
>>
>> > I am not sure what would change if I set min_size
>> > 2, because the only thing I would get is that there is no IOs to objects
>> > with less than 2 replicas, while there is rebuild ongoing. And in that case
>>
>> Which is exactly what you should want if you are concerened about data
>> integrity. If you allow IO via min_size to be served by a single OSD via
>> min_size=1 and you tehn lose that OSD you will lose any changes to the pgs
>> resulting in inconsistency and missing objects. You are removing the consistency
>> guarantees that ceph provides. So if you use min_size=1 you need to be
>> comfortable with the fact that there is a likelihood you *will* lose data at
>> some stage. Usually this is not what people implementing ceph want.
>>
>
> I'm using size=2 and min_size=1 on our main cluster.
> However that one has OSDs that are backed by RAID10s (so essentially a
> replication of 4), thus loosing an OSD (in the permanent disk failure
> sense) is very, VERY unlikely.

But you would accept the data loss if such an unlikely event did occur right?
Because you understand the implications of what you are doing and accept the
risk and are not arguing there is no risk. Your configuration builds a data
guarantee back in at a lower level so you are factoring that into your risk
evaluation.

>
> Choosing a size=2 for any OSDs that are backed by plain disks is shouting
> for Murphy to come and punish you for your insolence.
> A min_size of 1 just gives him another chance to smite you.

Right, anything is a possibility and you have to set a limit on risk versus
other factors including HA. I want to make sure people are aware of the risks.

>
> There used to be a corner case that with a size of 3 and min_size of 2 a
> dual disk failure would result in blocked I/Os despite things being
> remapped already and new writes going to healthy OSDs.
> So people did set min_size to 1 to cover for that.
> AFAIK, this has been addressed, but confirmation would be nice.

Do you have a tracker?

>
> Christian
>
>> > my vms wouldn't be able to read data from ceph pool. But maybe I got it
>> > wrong.
>> >
>> >
>> > W dniu 29.11.2016 o 03:08, Brad Hubbard pisze:
>> >
>> >>
>> >> On Mon, Nov 28, 2016 at 9:54 PM, Piotr Dzionek <piotr.dzionek@xxxxxxxx>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>> I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons, and 12
>> >>> osds. I removed default pool and created the following one:
>> >>>
>> >>> pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
>> >>> rjenkins pg_num 1024 pgp_num 1024 last_change 126 flags hashpspool
>> >>> stripe_width 0
>> >>
>> >> Do you understand the significance of min_size 1?
>> >>
>> >> Are you OK with the likelihood of data loss that this value introduces?
>> >>
>> >>> Cluster is healthy if all osds are up, however if I stop any of the osds,
>> >>> it
>> >>> becomes stuck and undersized - it is not rebuilding.
>> >>>
>> >>>      cluster *****
>> >>>       health HEALTH_WARN
>> >>>              166 pgs degraded
>> >>>              108 pgs stuck unclean
>> >>>              166 pgs undersized
>> >>>              recovery 67261/827220 objects degraded (8.131%)
>> >>>              1/12 in osds are down
>> >>>       monmap e3: 3 mons at
>> >>> {**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*****.146:6789/0}
>> >>>              election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03
>> >>>       osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs
>> >>>              flags sortbitwise
>> >>>        pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects
>> >>>              2452 GB used, 42231 GB / 44684 GB avail
>> >>>              67261/827220 objects degraded (8.131%)
>> >>>                   858 active+clean
>> >>>                   166 active+undersized+degraded
>> >>>
>> >>> Replica size is 2 and and I use the following crushmap:
>> >>>
>> >>> # begin crush map
>> >>> tunable choose_local_tries 0
>> >>> tunable choose_local_fallback_tries 0
>> >>> tunable choose_total_tries 50
>> >>> tunable chooseleaf_descend_once 1
>> >>> tunable chooseleaf_vary_r 1
>> >>> tunable straw_calc_version 1
>> >>>
>> >>> # devices
>> >>> device 0 osd.0
>> >>> device 1 osd.1
>> >>> device 2 osd.2
>> >>> device 3 osd.3
>> >>> device 4 osd.4
>> >>> device 5 osd.5
>> >>> device 6 osd.6
>> >>> device 7 osd.7
>> >>> device 8 osd.8
>> >>> device 9 osd.9
>> >>> device 10 osd.10
>> >>> device 11 osd.11
>> >>>
>> >>> # types
>> >>> type 0 osd
>> >>> type 1 host
>> >>> type 2 chassis
>> >>> type 3 rack
>> >>> type 4 row
>> >>> type 5 pdu
>> >>> type 6 pod
>> >>> type 7 room
>> >>> type 8 datacenter
>> >>> type 9 region
>> >>> type 10 root
>> >>>
>> >>> # buckets
>> >>> host osd01 {
>> >>>          id -2           # do not change unnecessarily
>> >>>          # weight 14.546
>> >>>          alg straw
>> >>>          hash 0  # rjenkins1
>> >>>          item osd.0 weight 3.636
>> >>>          item osd.1 weight 3.636
>> >>>          item osd.2 weight 3.636
>> >>>          item osd.3 weight 3.636
>> >>> }
>> >>> host osd02 {
>> >>>          id -3           # do not change unnecessarily
>> >>>          # weight 14.546
>> >>>          alg straw
>> >>>          hash 0  # rjenkins1
>> >>>          item osd.4 weight 3.636
>> >>>          item osd.5 weight 3.636
>> >>>          item osd.6 weight 3.636
>> >>>          item osd.7 weight 3.636
>> >>> }
>> >>> host osd03 {
>> >>>          id -4           # do not change unnecessarily
>> >>>          # weight 14.546
>> >>>          alg straw
>> >>>          hash 0  # rjenkins1
>> >>>          item osd.8 weight 3.636
>> >>>          item osd.9 weight 3.636
>> >>>          item osd.10 weight 3.636
>> >>>          item osd.11 weight 3.636
>> >>> }
>> >>> root default {
>> >>>          id -1           # do not change unnecessarily
>> >>>          # weight 43.637
>> >>>          alg straw
>> >>>          hash 0  # rjenkins1
>> >>>          item osd01 weight 14.546
>> >>>          item osd02 weight 14.546
>> >>>          item osd03 weight 14.546
>> >>> }
>> >>>
>> >>> # rules
>> >>> rule replicated_ruleset {
>> >>>          ruleset 0
>> >>>          type replicated
>> >>>          min_size 1
>> >>>          max_size 10
>> >>>          step take default
>> >>>          step chooseleaf firstn 0 type host
>> >>>          step emit
>> >>> }
>> >>>
>> >>> # end crush map
>> >>>
>> >>> I am not sure what is the reason for undersized state. All osd disks are
>> >>> the
>> >>> same size and replica size is 2. Also data is only replicated per hosts
>> >>> basis and I have 3 separate hosts. Maybe number of pg is incorrect ?  Is
>> >>> 1024 too big ? or maybe there is some misconfiguration in crushmap ?
>> >>>
>> >>>
>> >>> Kind regards,
>> >>> Piotr Dzionek
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list
>> >>> ceph-users@xxxxxxxxxxxxxx
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>
>> >>
>> >>
>> >
>> > --
>> > Piotr Dzionek
>> > System Administrator
>> >
>> > SEQR Poland Sp. z o.o.
>> > ul. Łąkowa 29, 90-554 Łódź, Poland
>> > Mobile: +48 796555587
>> > Mail: piotr.dzionek@xxxxxxxx
>> > www.seqr.com | www.seamless.se
>> >
>>
>>
>>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com