Hello, On Wed, 30 Nov 2016 13:39:50 +1000 Brad Hubbard wrote: > > > On Tue, Nov 29, 2016 at 11:37 PM, Piotr Dzionek <piotr.dzionek@xxxxxxxx> wrote: > > Hi, > > > > As far as I understand if I set pool size 2, there is a chance to loose data > > when another osd dies while there is rebuild ongoing. However, it has to > > occur on the different host, because my crushmap forbids to store replicas > > on the same physical node. > > I am not talking about size, I am talking specifically about min_size, > regardless of size. > > > I am not sure what would change if I set min_size > > 2, because the only thing I would get is that there is no IOs to objects > > with less than 2 replicas, while there is rebuild ongoing. And in that case > > Which is exactly what you should want if you are concerened about data > integrity. If you allow IO via min_size to be served by a single OSD via > min_size=1 and you tehn lose that OSD you will lose any changes to the pgs > resulting in inconsistency and missing objects. You are removing the consistency > guarantees that ceph provides. So if you use min_size=1 you need to be > comfortable with the fact that there is a likelihood you *will* lose data at > some stage. Usually this is not what people implementing ceph want. > I'm using size=2 and min_size=1 on our main cluster. However that one has OSDs that are backed by RAID10s (so essentially a replication of 4), thus loosing an OSD (in the permanent disk failure sense) is very, VERY unlikely. Choosing a size=2 for any OSDs that are backed by plain disks is shouting for Murphy to come and punish you for your insolence. A min_size of 1 just gives him another chance to smite you. There used to be a corner case that with a size of 3 and min_size of 2 a dual disk failure would result in blocked I/Os despite things being remapped already and new writes going to healthy OSDs. So people did set min_size to 1 to cover for that. AFAIK, this has been addressed, but confirmation would be nice. Christian > > my vms wouldn't be able to read data from ceph pool. But maybe I got it > > wrong. > > > > > > W dniu 29.11.2016 o 03:08, Brad Hubbard pisze: > > > >> > >> On Mon, Nov 28, 2016 at 9:54 PM, Piotr Dzionek <piotr.dzionek@xxxxxxxx> > >> wrote: > >>> > >>> Hi, > >>> I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons, and 12 > >>> osds. I removed default pool and created the following one: > >>> > >>> pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash > >>> rjenkins pg_num 1024 pgp_num 1024 last_change 126 flags hashpspool > >>> stripe_width 0 > >> > >> Do you understand the significance of min_size 1? > >> > >> Are you OK with the likelihood of data loss that this value introduces? > >> > >>> Cluster is healthy if all osds are up, however if I stop any of the osds, > >>> it > >>> becomes stuck and undersized - it is not rebuilding. > >>> > >>> cluster ***** > >>> health HEALTH_WARN > >>> 166 pgs degraded > >>> 108 pgs stuck unclean > >>> 166 pgs undersized > >>> recovery 67261/827220 objects degraded (8.131%) > >>> 1/12 in osds are down > >>> monmap e3: 3 mons at > >>> {**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*****.146:6789/0} > >>> election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03 > >>> osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs > >>> flags sortbitwise > >>> pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects > >>> 2452 GB used, 42231 GB / 44684 GB avail > >>> 67261/827220 objects degraded (8.131%) > >>> 858 active+clean > >>> 166 active+undersized+degraded > >>> > >>> Replica size is 2 and and I use the following crushmap: > >>> > >>> # begin crush map > >>> tunable choose_local_tries 0 > >>> tunable choose_local_fallback_tries 0 > >>> tunable choose_total_tries 50 > >>> tunable chooseleaf_descend_once 1 > >>> tunable chooseleaf_vary_r 1 > >>> tunable straw_calc_version 1 > >>> > >>> # devices > >>> device 0 osd.0 > >>> device 1 osd.1 > >>> device 2 osd.2 > >>> device 3 osd.3 > >>> device 4 osd.4 > >>> device 5 osd.5 > >>> device 6 osd.6 > >>> device 7 osd.7 > >>> device 8 osd.8 > >>> device 9 osd.9 > >>> device 10 osd.10 > >>> device 11 osd.11 > >>> > >>> # types > >>> type 0 osd > >>> type 1 host > >>> type 2 chassis > >>> type 3 rack > >>> type 4 row > >>> type 5 pdu > >>> type 6 pod > >>> type 7 room > >>> type 8 datacenter > >>> type 9 region > >>> type 10 root > >>> > >>> # buckets > >>> host osd01 { > >>> id -2 # do not change unnecessarily > >>> # weight 14.546 > >>> alg straw > >>> hash 0 # rjenkins1 > >>> item osd.0 weight 3.636 > >>> item osd.1 weight 3.636 > >>> item osd.2 weight 3.636 > >>> item osd.3 weight 3.636 > >>> } > >>> host osd02 { > >>> id -3 # do not change unnecessarily > >>> # weight 14.546 > >>> alg straw > >>> hash 0 # rjenkins1 > >>> item osd.4 weight 3.636 > >>> item osd.5 weight 3.636 > >>> item osd.6 weight 3.636 > >>> item osd.7 weight 3.636 > >>> } > >>> host osd03 { > >>> id -4 # do not change unnecessarily > >>> # weight 14.546 > >>> alg straw > >>> hash 0 # rjenkins1 > >>> item osd.8 weight 3.636 > >>> item osd.9 weight 3.636 > >>> item osd.10 weight 3.636 > >>> item osd.11 weight 3.636 > >>> } > >>> root default { > >>> id -1 # do not change unnecessarily > >>> # weight 43.637 > >>> alg straw > >>> hash 0 # rjenkins1 > >>> item osd01 weight 14.546 > >>> item osd02 weight 14.546 > >>> item osd03 weight 14.546 > >>> } > >>> > >>> # rules > >>> rule replicated_ruleset { > >>> ruleset 0 > >>> type replicated > >>> min_size 1 > >>> max_size 10 > >>> step take default > >>> step chooseleaf firstn 0 type host > >>> step emit > >>> } > >>> > >>> # end crush map > >>> > >>> I am not sure what is the reason for undersized state. All osd disks are > >>> the > >>> same size and replica size is 2. Also data is only replicated per hosts > >>> basis and I have 3 separate hosts. Maybe number of pg is incorrect ? Is > >>> 1024 too big ? or maybe there is some misconfiguration in crushmap ? > >>> > >>> > >>> Kind regards, > >>> Piotr Dzionek > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >> > >> > > > > -- > > Piotr Dzionek > > System Administrator > > > > SEQR Poland Sp. z o.o. > > ul. Łąkowa 29, 90-554 Łódź, Poland > > Mobile: +48 796555587 > > Mail: piotr.dzionek@xxxxxxxx > > www.seqr.com | www.seamless.se > > > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com