I did set configure the following during my initial setup: osd pool default size = 3 root@nodeC:/mnt/vmimages# ceph osd dump | grep "replicated size" pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 217 flags hashpspool stripe_width 0 pool 4 'vmimages' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 242 flags hashpspool stripe_width 0 pool 5 'vmimage-backups' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 777 flags hashpspool stripe_width 0 After adding 3 more osd, I see data is being replicated to the new osd :) and near full osd warning is gone. recovery some hours ago: >> recovery 389973/3096070 objects degraded (12.596%) >> recovery 1258984/3096070 objects misplaced (40.664%) recovery now: recovery 8917/3217724 objects degraded (0.277%) recovery 1120479/3217724 objects misplaced (34.822%) On Thu, Sep 1, 2016 at 4:13 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > Hello, > > On Thu, 1 Sep 2016 14:00:53 +0200 Ishmael Tsoaela wrote: > >> more questions and I hope you don;t mind: >> >> >> >> My understanding is that if I have 3 hosts with 5 osd each, 1 host >> goes down, Ceph should not replicate to the osd that are down. >> > How could it replicate to something that is down? > >> When the host comes up, only then the replication will commence right? >> > Depends on your configuration. > >> If only 1 osd out of 5 comes up, then only data meant for that osd >> should be copied to the osd? if so then why do pg get full if they >> were not full before osd went down? >> > Definitely not. > >> > You need to understand how CRUSH maps, rules and replication work. > > By default pools with Hammer and higher with will have a replicate size > of 3 and CRUSH picks OSDs based on a host failure domain, so that's why > you need at least 3 hosts with those default settings. > > So with these defaults Ceph would indeed have done nothing in a 3 node > cluster if one node had gone down. > It needs to put replicas on different nodes, but only 2 are available. > > However given what happened to your cluster it is obvious that your pools > have a replication size of 2 most likely. > Check with > ceph osd dump | grep "replicated size" > > In that case Ceph will try to recover and restore 2 replicas (original and > copy), resulting in what you're seeing. > > Christian > >> >> On Thu, Sep 1, 2016 at 1:29 PM, Ishmael Tsoaela <ishmaelt3@xxxxxxxxx> wrote: >> > Thank you again. >> > >> > I will add 3 more osd today and leave untouched, maybe over weekend. >> > >> > On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer <chibi@xxxxxxx> wrote: >> >> >> >> Hello, >> >> >> >> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote: >> >> >> >>> thanks for the response >> >>> >> >>> >> >>> >> >>> > You really will want to spend more time reading documentation and this ML, >> >>> > as well as using google to (re-)search things. >> >>> >> >>> >> >>> I did do some reading on the error but cannot understand why they do >> >>> not clear even after so long. >> >>> >> >>> > In your previous mail you already mentioned a 92% full OSD, that should >> >>> > combined with the various "full" warnings have impressed on you the need >> >>> > to address this issue. >> >>> >> >>> > When your nodes all rebooted, did everything come back up? >> >>> >> >>> One host with 5 osd were down nad came up later. >> >>> >> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in >> >>> time? >> >>> >> >>> about 7 hours >> >>> >> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in >> >>> time? about 7 hours >> >>> >> >> OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to >> >> restore redundancy, but had not enough space to do so and got itself stuck >> >> in a corner. >> >> >> >> Lesson here is: >> >> a) have enough space to cover the loss of one node (rack, etc) or >> >> b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you >> >> can recover a failed node before re-balancing starts. >> >> >> >> Of course b) assumes that you have 24/7 monitoring and access to your >> >> cluster, so that restoring a failed node is likely faster that >> >> re-balancing the data. >> >> >> >> >> >>> True >> >>> >> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too >> >>> > full for that. >> >>> > And until something changes it will be stuck there. >> >>> > Your best bet is to add more OSDs, since you seem to be short on space >> >>> > anyway. Or delete unneeded data. >> >>> > Given your level of experience, I'd advice against playing with weights >> >>> > and the respective "full" configuration options. >> >>> >> >>> I did reweights some osd but everything is back to normal. No config >> >>> changes on "Full" config >> >>> >> >>> I deleted about 900G this morning and prepared 3 osd, should I add them now? >> >>> >> >> More OSDs will both make things less likely to get full again and give the >> >> nearfull OSDs a place to move data to. >> >> >> >> However they will also cause more data movement, so if your cluster is >> >> busy, maybe do that during the night or weekend. >> >> >> >>> > Are these numbers and the recovery io below still changing, moving along? >> >>> >> >>> original email: >> >>> >> >>> > recovery 493335/3099981 objects degraded (15.914%) >> >>> > recovery 1377464/3099981 objects misplaced (44.435%) >> >>> >> >>> >> >>> current email: >> >>> >> >>> >> >>> recovery 389973/3096070 objects degraded (12.596%) >> >>> recovery 1258984/3096070 objects misplaced (40.664%) >> >>> >> >>> >> >> So there is progress, it may recover by itself after all. >> >> >> >> Looking at your "df" output only 7 OSDs seem to be nearfull now, is that >> >> correct? >> >> >> >> If so definitely progress, it's just taking a lot of time to recover. >> >> >> >> If the progress should stop before the cluster can get healthy again, >> >> write another mail with "ceph -s" and so forth for us to peruse. >> >> >> >> Christian >> >> >> >>> > Just to confirm, that's all the 15 OSDs your cluster ever had? >> >>> >> >>> yes >> >>> >> >>> >> >>> > Output from "ceph osd df" and "ceph osd tree" please. >> >>> >> >>> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS >> >>> 3 0.90868 1.00000 930G 232G 698G 24.96 0.40 105 >> >>> 5 0.90868 1.00000 930G 139G 791G 14.99 0.24 139 >> >>> 6 0.90868 1.00000 930G 61830M 870G 6.49 0.10 138 >> >>> 0 0.90868 1.00000 930G 304G 625G 32.76 0.53 128 >> >>> 2 0.90868 1.00000 930G 24253M 906G 2.55 0.04 130 >> >>> 1 0.90868 1.00000 930G 793G 137G 85.22 1.37 162 >> >>> 4 0.90868 1.00000 930G 790G 140G 84.91 1.36 160 >> >>> 7 0.90868 1.00000 930G 803G 127G 86.34 1.39 144 >> >>> 10 0.90868 1.00000 930G 792G 138G 85.16 1.37 145 >> >>> 13 0.90868 1.00000 930G 811G 119G 87.17 1.40 163 >> >>> 15 0.90869 1.00000 930G 794G 136G 85.37 1.37 157 >> >>> 16 0.90869 1.00000 930G 757G 172G 81.45 1.31 159 >> >>> 17 0.90868 1.00000 930G 800G 129G 86.06 1.38 144 >> >>> 18 0.90869 1.00000 930G 786G 144G 84.47 1.36 166 >> >>> 19 0.90868 1.00000 930G 793G 137G 85.26 1.37 160 >> >>> TOTAL 13958G 8683G 5274G 62.21 >> >>> MIN/MAX VAR: 0.04/1.40 STDDEV: 33.10 >> >>> >> >>> >> >>> >> >>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >> >>> -1 13.63019 root default >> >>> -2 4.54338 host nodeB >> >>> 3 0.90868 osd.3 up 1.00000 1.00000 >> >>> 5 0.90868 osd.5 up 1.00000 1.00000 >> >>> 6 0.90868 osd.6 up 1.00000 1.00000 >> >>> 0 0.90868 osd.0 up 1.00000 1.00000 >> >>> 2 0.90868 osd.2 up 1.00000 1.00000 >> >>> -3 4.54338 host nodeC >> >>> 1 0.90868 osd.1 up 1.00000 1.00000 >> >>> 4 0.90868 osd.4 up 1.00000 1.00000 >> >>> 7 0.90868 osd.7 up 1.00000 1.00000 >> >>> 10 0.90868 osd.10 up 1.00000 1.00000 >> >>> 13 0.90868 osd.13 up 1.00000 1.00000 >> >>> -6 4.54343 host nodeD >> >>> 15 0.90869 osd.15 up 1.00000 1.00000 >> >>> 16 0.90869 osd.16 up 1.00000 1.00000 >> >>> 17 0.90868 osd.17 up 1.00000 1.00000 >> >>> 18 0.90869 osd.18 up 1.00000 1.00000 >> >>> 19 0.90868 osd.19 up 1.00000 1.00000 >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Thu, Sep 1, 2016 at 10:56 AM, Christian Balzer <chibi@xxxxxxx> wrote: >> >>> > >> >>> > >> >>> > Hello, >> >>> > >> >>> > On Thu, 1 Sep 2016 10:18:39 +0200 Ishmael Tsoaela wrote: >> >>> > >> >>> > > Hi All, >> >>> > > >> >>> > > Can someone please decipher this errors for me, after all nodes rebooted in >> >>> > > my cluster on Monday. the warning has not gone. >> >>> > > >> >>> > You really will want to spend more time reading documentation and this ML, >> >>> > as well as using google to (re-)search things. >> >>> > Like searching for "backfill_toofull", "near full", etc. >> >>> > >> >>> > >> >>> > > Will the warning ever clear? >> >>> > > >> >>> > Unlikely. >> >>> > >> >>> > In your previous mail you already mentioned a 92% full OSD, that should >> >>> > combined with the various "full" warnings have impressed on you the need >> >>> > to address this issue. >> >>> > >> >>> > When your nodes all rebooted, did everything come back up? >> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in >> >>> > time? >> >>> > My guess is that some nodes/OSDs where restarted a lot later than others. >> >>> > >> >>> > See inline: >> >>> > > >> >>> > > cluster df3f96d8-3889-4baa-8b27-cc2839141425 >> >>> > > health HEALTH_WARN >> >>> > > 2 pgs backfill_toofull >> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too >> >>> > full for that. >> >>> > And until something changes it will be stuck there. >> >>> > >> >>> > Your best bet is to add more OSDs, since you seem to be short on space >> >>> > anyway. Or delete unneeded data. >> >>> > Given your level of experience, I'd advice against playing with weights >> >>> > and the respective "full" configuration options. >> >>> > >> >>> > > 532 pgs backfill_wait >> >>> > > 3 pgs backfilling >> >>> > > 330 pgs degraded >> >>> > > 537 pgs stuck unclean >> >>> > > 330 pgs undersized >> >>> > > recovery 493335/3099981 objects degraded (15.914%) >> >>> > > recovery 1377464/3099981 objects misplaced (44.435%) >> >>> > Are these numbers and the recovery io below still changing, moving along? >> >>> > >> >>> > > 8 near full osd(s) >> >>> > 8 out of 15, definitely needs more OSD. >> >>> > Output from "ceph osd df" and "ceph osd tree" please. >> >>> > >> >>> > > monmap e7: 3 mons at {Monitors} >> >>> > > election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD >> >>> > > osdmap e3922: 15 osds: 15 up, 15 in; 537 remapped pgs >> >>> > >> >>> > Just to confirm, that's all the 15 OSDs your cluster ever had? >> >>> > >> >>> > Christian >> >>> > >> >>> > > flags sortbitwise >> >>> > > pgmap v2431741: 640 pgs, 3 pools, 3338 GB data, 864 kobjects >> >>> > > 8242 GB used, 5715 GB / 13958 GB avail >> >>> > > 493335/3099981 objects degraded (15.914%) >> >>> > > 1377464/3099981 objects misplaced (44.435%) >> >>> > > 327 active+undersized+degraded+remapped+wait_backfill >> >>> > > 205 active+remapped+wait_backfill >> >>> > > 103 active+clean >> >>> > > 3 active+undersized+degraded+remapped+backfilling >> >>> > > 2 active+remapped+backfill_toofull >> >>> > > recovery io 367 MB/s, 96 objects/s >> >>> > > client io 5699 B/s rd, 23749 B/s wr, 2 op/s rd, 12 op/s wr >> >>> > >> >>> > >> >>> > -- >> >>> > Christian Balzer Network/Systems Engineer >> >>> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> >>> > http://www.gol.com/ >> >>> >> >> >> >> >> >> -- >> >> Christian Balzer Network/Systems Engineer >> >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> >> http://www.gol.com/ >> > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com