Re: Objects degraded after adding disks

Frank Schilder <frans@xxxxxx> · Thu, 3 Oct 2019 12:24:35 +0000

Hi Robert,

thanks for looking at this. The explanation is a different one though.

Today I added disks to the second server that was in exactly the same state as the other one reported below. I used this opportunity to do a modified reboot + OSD adding sequence.

To recall the situation, I added disks to a server and needed to reboot to get correct persistent device names assigned (the boot drive will move further down) and then deploy the new OSDs. This time, I wanted to see what happens if I revert changes from adding the new OSDs and set the system back into the state before reboot+deploy. Here is what happened using the full ceph-command sequence (I omit other stuff):

2019-10-03 12:57 ceph status:
  cluster:
    id:     XXX
    health: HEALTH_OK

- set cluster to maintenance mode:

 6975  2019-10-03 12:57  ceph osd set noout
 6976  2019-10-03 12:57  ceph osd set nobackfill
 6977  2019-10-03 12:57  ceph osd set norebalance

- reboot host
- redeploy OSDs
- wait for peering to finish

2019-10-03 13:08 ceph status:
  cluster:
    id:     XXX
    health: HEALTH_ERR
            noout,nobackfill,norebalance flag(s) set
            17173162/145147628 objects misplaced (11.832%)
            Degraded data redundancy: 5865332/145147628 objects degraded (4.041%), 215 pgs degraded, 217 pgs undersized
            Degraded data redundancy (low space): 99 pgs backfill_toofull

- lots of degraded objects even though the "old" OSDs are all up
- starting to restore pre-deploy situation by moving new OSDs to a temporary bucket in the crush map

 6983  2019-10-03 13:08  ceph osd unset noout
 6984  2019-10-03 13:08  ceph osd set noup
 6985  2019-10-03 13:08  ceph osd set noin
 6986  2019-10-03 13:09  ceph osd down 108 121 132 61 62 97 223 226
 6987  2019-10-03 13:09  ceph osd out 108 121 132 61 62 97 223 226
 6988  2019-10-03 13:10  ceph osd unset nobackfill
 6989  2019-10-03 13:10  ceph osd unset norebalance
...
 6992  2019-10-03 13:17  ceph osd unset noup
 6993  2019-10-03 13:18  ceph osd unset noin
...
 6997  2019-10-03 13:20  ceph osd set norebalance
 6998  2019-10-03 13:20  ceph osd set nobackfill
 6999  2019-10-03 13:20  ceph osd crush move osd.108 host=bb-17
 7000  2019-10-03 13:20  ceph osd crush move osd.121 host=bb-17
 7001  2019-10-03 13:21  ceph osd crush move osd.132 host=bb-17
 7002  2019-10-03 13:21  ceph osd crush move osd.61 host=bb-17
 7003  2019-10-03 13:21  ceph osd crush move osd.62 host=bb-17
 7004  2019-10-03 13:21  ceph osd crush move osd.97 host=bb-17
 7005  2019-10-03 13:21  ceph osd crush move osd.223 host=bb-17
 7006  2019-10-03 13:21  ceph osd crush move osd.226 host=bb-17
...
 7008  2019-10-03 13:21  ceph osd unset norebalance
 7009  2019-10-03 13:22  ceph osd unset nobackfill

- wait a bit

2019-10-03 13:22 ceph status:
  cluster:
    id:     XXX
    health: HEALTH_OK

- apparently, it is the new OSDs in the crush map that prevent ceph from finding the shards on the old OSDs
- now add the disks by restarting the OSDs, they will move themselves back to the correct permanent crush location

 7013  2019-10-03 13:22  ceph osd set norebalance
 7014  2019-10-03 13:23  ceph osd set nobackfill
 7016  2019-10-03 13:23  ssh ceph-17 docker start osd-sde osd-sdf osd-sdg osd-sdh osd-sdi osd-sdj osd-sdk osd-sdl
 7017  2019-10-03 13:24  ceph osd in 108 121 132 61 62 97 223 226

- wait for peering to finish

 7018  2019-10-03 13:24  ceph osd unset norebalance
 7019  2019-10-03 13:24  ceph osd unset nobackfill

- we end up with a rebalancing as it should be, no degraded objects:

2019-10-03 13:24 ceph status:
  cluster:
    id:     XXX
    health: HEALTH_ERR
            23045945/145214167 objects misplaced (15.870%)
            Degraded data redundancy (low space): 33 pgs backfill_toofull

The backfill_toofull message is still annoying, but, contrary to the health report, there is no lack of redundancy. All PGs have a complete acting set.

It seems like the order of operations makes the difference. I was under the impression that ceph would  first search all (relevant?) OSDs for missing objects before starting a rebuild. This is apparently not the case. Two days ago, I did not help ceph to find the missing shards and ended up with a lot of partial data loss (one out of 2 parity shards on many PGs).

This time I basically told ceph explicitly where to look and it went the way I was expecting it would go without manual intervention. Good to know for next time.

Best regards and thanks for your help,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Robert LeBlanc <robert@xxxxxxxxxxxxx>
Sent: 01 October 2019 17:13:41
To: Frank Schilder
Cc: ceph-users
Subject: Re:  Objects degraded after adding disks

On Tue, Oct 1, 2019 at 5:25 AM Frank Schilder <frans@xxxxxx> wrote:
>
> I'm running a cepf fs with an 8+2 EC data pool. Disks are on 10 hosts and failure domain is host. Version is mimic 13.2.2. Today I added a few OSDs to one of the hosts and observed that a lot of PGs became inactive even though 9 out of 10 hosts were up all the time. After getting the 10th host and all disks up, I still ended up with a large amount of undersized PGs and degraded objects, which I don't understand as no OSD was removed.
>
>
> Here some details about the steps taken on the host with new disks, main questions at the end:
>
> - shut down OSDs (systemctl stop docker)
> - reboot host (this is necessary due to OS deployment via warewulf)
>
> Devices got renamed and not all disks came back up (4 OSDs remained down). This is expected, I need to re-deploy the containers to adjust for device name changes. Around this point PGs started peering and some failed waiting for 1 of the down OSDs. I don't understand why they didn't just remain active with 9 out of 10 disks. Until this moment of some OSDs coming up, all PGs were active. With min_size=9 I would expect all PGs to remain active with no changes to 9 out of the 10 hosts.
>
> - redeploy docker containers
> - all disks/OSDs come up, including the 4 OSDs from above
> - inactive PGs complete peering and become active
> - now I have a los of degraded Objects and undersized PGs even though not a single OSD was removed
>
> I don't understand why I have degraded objects. I should just have misplaced objects:
>
> HEALTH_ERR
>             22995992/145698909 objects misplaced (15.783%)
>             Degraded data redundancy: 5213734/145698909 objects degraded (3.578%), 208 pgs degraded, 208
> pgs undersized
>             Degraded data redundancy (low space): 169 pgs backfill_toofull
>
> Note: The backfill_toofull with low utilization (usage: 38 TiB used, 1.5 PiB / 1.5 PiB avail) is a known issue in ceph (https://tracker.ceph.com/issues/39555)
>
> Also, I should be able to do whatever with 1 out of 10 hosts without loosing data access. What could be the problem here?
>
>
> Questions summary:
>
> Why does peering not succeed to keep all PGs active with 9 out of 10 OSDs up and in?

I would just double check that min_size=9 for your pool, it should be
set to that, but that is the only reason I can think that you are
seeing this problem.

> Why do undersized PGs arise even though all OSDs are up?

I've noticed on my cluster that sometimes when an OSD goes down, the
EC considers the OSD missing when it comes back online and needs to
resync. Not sure what exactly causes this to happen, but it happens
more often than it should.

> Why do degraded objects arise even though no OSD was removed?

If you are writing objects while the PGs are undersized (host/osds
down), then it will have to sync those writes to the OSDs that were
down. This is the number of degraded objects.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx