Huge rebalance after rebooting OSD host (Mimic)

Jan Kasprzak <kas@xxxxxxxxxx> · Wed, 15 May 2019 14:46:15 +0200

	Hello, Ceph users,

I wanted to install the recent kernel update on my OSD hosts
with CentOS 7, Ceph 13.2.5 Mimic. So I set a noout flag and ran
"yum -y update" on the first OSD host. This host has 8 bluestore OSDs
with data on HDDs and database on LVs of two SSDs (each SSD has 4 LVs
for OSD metadata).

	Everything went OK, so I rebooted this host. After the OSD host
went back online, the cluster went from HEALTH_WARN (noout flag set)
to HEALTH_ERR, and started to rebalance itself, with reportedly almost 60 %
objects misplaced, and some of them degraded. And, of course backfill_toofull:

  cluster:
    health: HEALTH_ERR
            2300616/3975384 objects misplaced (57.872%)
            Degraded data redundancy: 74263/3975384 objects degraded (1.868%), 146 pgs degraded, 122 pgs undersized
            Degraded data redundancy (low space): 44 pgs backfill_toofull

  services:
    mon: 3 daemons, quorum stratus1,stratus2,stratus3
    mgr: stratus3(active), standbys: stratus1, stratus2
    osd: 44 osds: 44 up, 44 in; 2022 remapped pgs
    rgw: 1 daemon active

  data:
    pools:   9 pools, 3360 pgs
    objects: 1.33 M objects, 5.0 TiB
    usage:   25 TiB used, 465 TiB / 490 TiB avail
    pgs:     74263/3975384 objects degraded (1.868%)
             2300616/3975384 objects misplaced (57.872%)
             1739 active+remapped+backfill_wait
             1329 active+clean
             102  active+recovery_wait+remapped
             76   active+undersized+degraded+remapped+backfill_wait
             31   active+remapped+backfill_wait+backfill_toofull
             30   active+recovery_wait+undersized+degraded+remapped
             21   active+recovery_wait+degraded+remapped
             8    active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             6    active+recovery_wait+degraded
             4    active+remapped+backfill_toofull
             3    active+recovery_wait+undersized+degraded
             3    active+remapped+backfilling
             2    active+recovery_wait
             2    active+recovering+undersized
             1    active+clean+remapped
             1    active+undersized+degraded+remapped+backfill_toofull
             1    active+undersized+degraded+remapped+backfilling
             1    active+recovering+undersized+remapped

  io:
    client:   681 B/s rd, 1013 KiB/s wr, 0 op/s rd, 32 op/s wr
    recovery: 142 MiB/s, 93 objects/s

(note that I cleaned the noout flag afterwards). What is wrong with it?
Why did the cluster decided to rebalance itself?

I am keeping the rest of the OSD hosts unrebooted for now.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com