Re: - cluster stuck and undersized if at least one osd is down

Piotr Dzionek <piotr.dzionek@xxxxxxxx> · Thu, 1 Dec 2016 18:16:26 +0100

Ok, you convinced me to increase size to 3 and min_size to 2. During my 
time running ceph I only had issues like single disk or host failures - 
nothing exotic, but I think it is better to be safe than sorry.

Kind regards,
Piotr Dzionek

W dniu 30.11.2016 o 12:16, Nick Fisk pisze:
-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Piotr Dzionek
Sent: 30 November 2016 11:04
To: Brad Hubbard <bhubbard@xxxxxxxxxx>
Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  - cluster stuck and undersized if at least one osd is down

Hi,

Ok, but I still don't get what advantage would I get from blocked IOs.
If I set size=2 and min_size=2 and during rebuild another disk dies on the other node, I will loose data. I know that I should set
size=3,
it is the much safer. But I don't see what is the advantage of blocked io ?
Maybe you mean faster rebuild ? or maybe if there is no IOs the likehood of another disk failure drops ?
You need to think about exotic scenarios where an OSD may fail for other reasons than just a straight disk failure. Probably the
most dangerous times are when OSD's start flapping for what other reasons. You can quickly get into a situation where an object is
updated on a single remaining OSD. If this OSD now goes down and other copy comes back up, Ceph will mark the object unfound as it's
not the latest version of the object. If whatever reason this OSD with the latest copy is no longer available, you now have
dataloss.

There are several mailing list posts and blogs where people have had several days outages and data loss caused by this situation. I
experienced 2 OSD's dying a couple of weeks ago after deleting a large snapshot somehow got them in an inconsistent state after
OSD's all over the cluster were flapping. If I was using min_size=1, I know I would have lost objects.

I'm not convinced a size=2 min_size=1 is completely safe on OSD's over RAID either. Whilst you are protected from a disk failing
scenario. Flapping or OSD bug is still just as likely to cause data loss.

W dniu 30.11.2016 o 04:39, Brad Hubbard pisze:
On Tue, Nov 29, 2016 at 11:37 PM, Piotr Dzionek <piotr.dzionek@xxxxxxxx> wrote:
Hi,

As far as I understand if I set pool size 2, there is a chance to
loose data when another osd dies while there is rebuild ongoing.
However, it has to occur on the different host, because my crushmap
forbids to store replicas on the same physical node.
I am not talking about size, I am talking specifically about min_size,
regardless of size.

I am not sure what would change if I set min_size 2, because the only
thing I would get is that there is no IOs to objects with less than 2
replicas, while there is rebuild ongoing. And in that case
Which is exactly what you should want if you are concerened about data
integrity. If you allow IO via min_size to be served by a single OSD
via
min_size=1 and you tehn lose that OSD you will lose any changes to the
pgs resulting in inconsistency and missing objects. You are removing
the consistency guarantees that ceph provides. So if you use
min_size=1 you need to be comfortable with the fact that there is a
likelihood you *will* lose data at some stage. Usually this is not what people implementing ceph want.

my vms wouldn't be able to read data from ceph pool. But maybe I got
it wrong.

W dniu 29.11.2016 o 03:08, Brad Hubbard pisze:

On Mon, Nov 28, 2016 at 9:54 PM, Piotr Dzionek
<piotr.dzionek@xxxxxxxx>
wrote:
Hi,
I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons,
and 12 osds. I removed default pool and created the following one:

pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 126 flags
hashpspool stripe_width 0
Do you understand the significance of min_size 1?

Are you OK with the likelihood of data loss that this value introduces?

Cluster is healthy if all osds are up, however if I stop any of the
osds, it becomes stuck and undersized - it is not rebuilding.

       cluster *****
        health HEALTH_WARN
               166 pgs degraded
               108 pgs stuck unclean
               166 pgs undersized
               recovery 67261/827220 objects degraded (8.131%)
               1/12 in osds are down
        monmap e3: 3 mons at
{**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*****.146:6789/0}
               election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03
        osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs
               flags sortbitwise
         pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects
               2452 GB used, 42231 GB / 44684 GB avail
               67261/827220 objects degraded (8.131%)
                    858 active+clean
                    166 active+undersized+degraded

Replica size is 2 and and I use the following crushmap:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0 tunable choose_total_tries 50
tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osd01 {
           id -2           # do not change unnecessarily
           # weight 14.546
           alg straw
           hash 0  # rjenkins1
           item osd.0 weight 3.636
           item osd.1 weight 3.636
           item osd.2 weight 3.636
           item osd.3 weight 3.636
}
host osd02 {
           id -3           # do not change unnecessarily
           # weight 14.546
           alg straw
           hash 0  # rjenkins1
           item osd.4 weight 3.636
           item osd.5 weight 3.636
           item osd.6 weight 3.636
           item osd.7 weight 3.636
}
host osd03 {
           id -4           # do not change unnecessarily
           # weight 14.546
           alg straw
           hash 0  # rjenkins1
           item osd.8 weight 3.636
           item osd.9 weight 3.636
           item osd.10 weight 3.636
           item osd.11 weight 3.636
}
root default {
           id -1           # do not change unnecessarily
           # weight 43.637
           alg straw
           hash 0  # rjenkins1
           item osd01 weight 14.546
           item osd02 weight 14.546
           item osd03 weight 14.546
}

# rules
rule replicated_ruleset {
           ruleset 0
           type replicated
           min_size 1
           max_size 10
           step take default
           step chooseleaf firstn 0 type host
           step emit
}

# end crush map

I am not sure what is the reason for undersized state. All osd
disks are the same size and replica size is 2. Also data is only
replicated per hosts basis and I have 3 separate hosts. Maybe
number of pg is incorrect ?  Is
1024 too big ? or maybe there is some misconfiguration in crushmap ?

Kind regards,
Piotr Dzionek

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Piotr Dzionek
System Administrator

SEQR Poland Sp. z o.o.
ul. Łąkowa 29, 90-554 Łódź, Poland
Mobile: +48 796555587
Mail: piotr.dzionek@xxxxxxxx
www.seqr.com | www.seamless.se

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com