Hi Robert, thanks for looking at this. The explanation is a different one though. Today I added disks to the second server that was in exactly the same state as the other one reported below. I used this opportunity to do a modified reboot + OSD adding sequence. To recall the situation, I added disks to a server and needed to reboot to get correct persistent device names assigned (the boot drive will move further down) and then deploy the new OSDs. This time, I wanted to see what happens if I revert changes from adding the new OSDs and set the system back into the state before reboot+deploy. Here is what happened using the full ceph-command sequence (I omit other stuff): 2019-10-03 12:57 ceph status: cluster: id: XXX health: HEALTH_OK - set cluster to maintenance mode: 6975 2019-10-03 12:57 ceph osd set noout 6976 2019-10-03 12:57 ceph osd set nobackfill 6977 2019-10-03 12:57 ceph osd set norebalance - reboot host - redeploy OSDs - wait for peering to finish 2019-10-03 13:08 ceph status: cluster: id: XXX health: HEALTH_ERR noout,nobackfill,norebalance flag(s) set 17173162/145147628 objects misplaced (11.832%) Degraded data redundancy: 5865332/145147628 objects degraded (4.041%), 215 pgs degraded, 217 pgs undersized Degraded data redundancy (low space): 99 pgs backfill_toofull - lots of degraded objects even though the "old" OSDs are all up - starting to restore pre-deploy situation by moving new OSDs to a temporary bucket in the crush map 6983 2019-10-03 13:08 ceph osd unset noout 6984 2019-10-03 13:08 ceph osd set noup 6985 2019-10-03 13:08 ceph osd set noin 6986 2019-10-03 13:09 ceph osd down 108 121 132 61 62 97 223 226 6987 2019-10-03 13:09 ceph osd out 108 121 132 61 62 97 223 226 6988 2019-10-03 13:10 ceph osd unset nobackfill 6989 2019-10-03 13:10 ceph osd unset norebalance ... 6992 2019-10-03 13:17 ceph osd unset noup 6993 2019-10-03 13:18 ceph osd unset noin ... 6997 2019-10-03 13:20 ceph osd set norebalance 6998 2019-10-03 13:20 ceph osd set nobackfill 6999 2019-10-03 13:20 ceph osd crush move osd.108 host=bb-17 7000 2019-10-03 13:20 ceph osd crush move osd.121 host=bb-17 7001 2019-10-03 13:21 ceph osd crush move osd.132 host=bb-17 7002 2019-10-03 13:21 ceph osd crush move osd.61 host=bb-17 7003 2019-10-03 13:21 ceph osd crush move osd.62 host=bb-17 7004 2019-10-03 13:21 ceph osd crush move osd.97 host=bb-17 7005 2019-10-03 13:21 ceph osd crush move osd.223 host=bb-17 7006 2019-10-03 13:21 ceph osd crush move osd.226 host=bb-17 ... 7008 2019-10-03 13:21 ceph osd unset norebalance 7009 2019-10-03 13:22 ceph osd unset nobackfill - wait a bit 2019-10-03 13:22 ceph status: cluster: id: XXX health: HEALTH_OK - apparently, it is the new OSDs in the crush map that prevent ceph from finding the shards on the old OSDs - now add the disks by restarting the OSDs, they will move themselves back to the correct permanent crush location 7013 2019-10-03 13:22 ceph osd set norebalance 7014 2019-10-03 13:23 ceph osd set nobackfill 7016 2019-10-03 13:23 ssh ceph-17 docker start osd-sde osd-sdf osd-sdg osd-sdh osd-sdi osd-sdj osd-sdk osd-sdl 7017 2019-10-03 13:24 ceph osd in 108 121 132 61 62 97 223 226 - wait for peering to finish 7018 2019-10-03 13:24 ceph osd unset norebalance 7019 2019-10-03 13:24 ceph osd unset nobackfill - we end up with a rebalancing as it should be, no degraded objects: 2019-10-03 13:24 ceph status: cluster: id: XXX health: HEALTH_ERR 23045945/145214167 objects misplaced (15.870%) Degraded data redundancy (low space): 33 pgs backfill_toofull The backfill_toofull message is still annoying, but, contrary to the health report, there is no lack of redundancy. All PGs have a complete acting set. It seems like the order of operations makes the difference. I was under the impression that ceph would first search all (relevant?) OSDs for missing objects before starting a rebuild. This is apparently not the case. Two days ago, I did not help ceph to find the missing shards and ended up with a lot of partial data loss (one out of 2 parity shards on many PGs). This time I basically told ceph explicitly where to look and it went the way I was expecting it would go without manual intervention. Good to know for next time. Best regards and thanks for your help, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Robert LeBlanc <robert@xxxxxxxxxxxxx> Sent: 01 October 2019 17:13:41 To: Frank Schilder Cc: ceph-users Subject: Re: Objects degraded after adding disks On Tue, Oct 1, 2019 at 5:25 AM Frank Schilder <frans@xxxxxx> wrote: > > I'm running a cepf fs with an 8+2 EC data pool. Disks are on 10 hosts and failure domain is host. Version is mimic 13.2.2. Today I added a few OSDs to one of the hosts and observed that a lot of PGs became inactive even though 9 out of 10 hosts were up all the time. After getting the 10th host and all disks up, I still ended up with a large amount of undersized PGs and degraded objects, which I don't understand as no OSD was removed. > > > Here some details about the steps taken on the host with new disks, main questions at the end: > > - shut down OSDs (systemctl stop docker) > - reboot host (this is necessary due to OS deployment via warewulf) > > Devices got renamed and not all disks came back up (4 OSDs remained down). This is expected, I need to re-deploy the containers to adjust for device name changes. Around this point PGs started peering and some failed waiting for 1 of the down OSDs. I don't understand why they didn't just remain active with 9 out of 10 disks. Until this moment of some OSDs coming up, all PGs were active. With min_size=9 I would expect all PGs to remain active with no changes to 9 out of the 10 hosts. > > - redeploy docker containers > - all disks/OSDs come up, including the 4 OSDs from above > - inactive PGs complete peering and become active > - now I have a los of degraded Objects and undersized PGs even though not a single OSD was removed > > I don't understand why I have degraded objects. I should just have misplaced objects: > > HEALTH_ERR > 22995992/145698909 objects misplaced (15.783%) > Degraded data redundancy: 5213734/145698909 objects degraded (3.578%), 208 pgs degraded, 208 > pgs undersized > Degraded data redundancy (low space): 169 pgs backfill_toofull > > Note: The backfill_toofull with low utilization (usage: 38 TiB used, 1.5 PiB / 1.5 PiB avail) is a known issue in ceph (https://tracker.ceph.com/issues/39555) > > Also, I should be able to do whatever with 1 out of 10 hosts without loosing data access. What could be the problem here? > > > Questions summary: > > Why does peering not succeed to keep all PGs active with 9 out of 10 OSDs up and in? I would just double check that min_size=9 for your pool, it should be set to that, but that is the only reason I can think that you are seeing this problem. > Why do undersized PGs arise even though all OSDs are up? I've noticed on my cluster that sometimes when an OSD goes down, the EC considers the OSD missing when it comes back online and needs to resync. Not sure what exactly causes this to happen, but it happens more often than it should. > Why do degraded objects arise even though no OSD was removed? If you are writing objects while the PGs are undersized (host/osds down), then it will have to sync those writes to the OSDs that were down. This is the number of degraded objects. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx