Re: Huge rebalance after rebooting OSD host (Mimic)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Are you sure your osd's are up and reachable? (run ceph osd tree on 
another node)



-----Original Message-----
From: Jan Kasprzak [mailto:kas@xxxxxxxxxx] 
Sent: woensdag 15 mei 2019 14:46
To: ceph-users@xxxxxxxx
Subject:  Huge rebalance after rebooting OSD host (Mimic)

	Hello, Ceph users,

I wanted to install the recent kernel update on my OSD hosts with CentOS 
7, Ceph 13.2.5 Mimic. So I set a noout flag and ran "yum -y update" on 
the first OSD host. This host has 8 bluestore OSDs with data on HDDs and 
database on LVs of two SSDs (each SSD has 4 LVs for OSD metadata).

	Everything went OK, so I rebooted this host. After the OSD host 
went back online, the cluster went from HEALTH_WARN (noout flag set) to 
HEALTH_ERR, and started to rebalance itself, with reportedly almost 60 % 
objects misplaced, and some of them degraded. And, of course 
backfill_toofull:

  cluster:
    health: HEALTH_ERR
            2300616/3975384 objects misplaced (57.872%)
            Degraded data redundancy: 74263/3975384 objects degraded 
(1.868%), 146 pgs degraded, 122 pgs undersized
            Degraded data redundancy (low space): 44 pgs 
backfill_toofull
 
  services:
    mon: 3 daemons, quorum stratus1,stratus2,stratus3
    mgr: stratus3(active), standbys: stratus1, stratus2
    osd: 44 osds: 44 up, 44 in; 2022 remapped pgs
    rgw: 1 daemon active
 
  data:
    pools:   9 pools, 3360 pgs
    objects: 1.33 M objects, 5.0 TiB
    usage:   25 TiB used, 465 TiB / 490 TiB avail
    pgs:     74263/3975384 objects degraded (1.868%)
             2300616/3975384 objects misplaced (57.872%)
             1739 active+remapped+backfill_wait
             1329 active+clean
             102  active+recovery_wait+remapped
             76   active+undersized+degraded+remapped+backfill_wait
             31   active+remapped+backfill_wait+backfill_toofull
             30   active+recovery_wait+undersized+degraded+remapped
             21   active+recovery_wait+degraded+remapped
             8    
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             6    active+recovery_wait+degraded
             4    active+remapped+backfill_toofull
             3    active+recovery_wait+undersized+degraded
             3    active+remapped+backfilling
             2    active+recovery_wait
             2    active+recovering+undersized
             1    active+clean+remapped
             1    active+undersized+degraded+remapped+backfill_toofull
             1    active+undersized+degraded+remapped+backfilling
             1    active+recovering+undersized+remapped
 
  io:
    client:   681 B/s rd, 1013 KiB/s wr, 0 op/s rd, 32 op/s wr
    recovery: 142 MiB/s, 93 objects/s
 
(note that I cleaned the noout flag afterwards). What is wrong with it?
Why did the cluster decided to rebalance itself?

I am keeping the rest of the OSD hosts unrebooted for now.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - 
private}> |
| http://www.fi.muni.cz/~kas/                         GPG: 
4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from 
rcgroups
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux