Hi, currently ceph -s is not reporting any unknown PGs. The following flags are set: nobackfill, norebalance, norecover There are no PGs in stuck peering either. And there's very little traffic on the ceph network. This is the output of today: root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_ERR 1 filesystem is degraded 1 filesystem has a failed mds daemon 1 filesystem is offline insufficient standby MDS daemons available nobackfill,norebalance,norecover flag(s) set 83 nearfull osd(s) 1 pool(s) nearfull Degraded data redundancy: 360047/153249771 objects degraded (0.235%), 78 pgs degraded, 81 pgs undersized Degraded data redundancy (low space): 265 pgs backfill_toofull 3 pools have too many placement groups services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 109m) mgr: ld5505(active, since 2d), standbys: ld5506, ld5507 mds: pve_cephfs:0/1, 1 failed osd: 368 osds: 368 up, 367 in; 398 remapped pgs flags nobackfill,norebalance,norecover data: pools: 5 pools, 8868 pgs objects: 51.08M objects, 195 TiB usage: 590 TiB used, 562 TiB / 1.1 PiB avail pgs: 360047/153249771 objects degraded (0.235%) 1998603/153249771 objects misplaced (1.304%) 8469 active+clean 124 active+remapped+backfill_toofull 83 active+remapped+backfill_wait+backfill_toofull 77 active+remapped+backfill_wait 45 active+undersized+degraded+remapped+backfill_toofull 33 active+remapped+backfilling 13 active+undersized+degraded+remapped+backfill_wait+backfill_toofull 11 active+undersized+degraded+remapped+backfilling 4 active+undersized+degraded+remapped+backfill_wait 4 active+recovery_wait+undersized+degraded+remapped 3 active+recovering+undersized+remapped 1 active+recovering+undersized+degraded+remapped 1 active+recovering io: client: 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr progress: Rebalancing after osd.9 marked out [====================..........] 2 days before the output was: root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_ERR 1 filesystem is degraded 1 filesystem has a failed mds daemon 1 filesystem is offline insufficient standby MDS daemons available nobackfill,norebalance,norecover flag(s) set 2 backfillfull osd(s) 86 nearfull osd(s) 1 pool(s) backfillfull Reduced data availability: 75 pgs inactive, 74 pgs peering Degraded data redundancy: 364117/154942251 objects degraded (0.235%), 76 pgs degraded, 76 pgs undersized Degraded data redundancy (low space): 309 pgs backfill_toofull 3 pools have too many placement groups 105 slow requests are blocked > 32 sec 91 stuck requests are blocked > 4096 sec services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 9h) mgr: ld5505(active, since 9h), standbys: ld5506, ld5507 mds: pve_cephfs:0/1, 1 failed osd: 368 osds: 368 up, 367 in; 400 remapped pgs flags nobackfill,norebalance,norecover data: pools: 5 pools, 8868 pgs objects: 51.65M objects, 197 TiB usage: 596 TiB used, 554 TiB / 1.1 PiB avail pgs: 0.011% pgs unknown 0.834% pgs not active 364117/154942251 objects degraded (0.235%) 2003579/154942251 objects misplaced (1.293%) 8395 active+clean 209 active+remapped+backfill_toofull 69 active+undersized+degraded+remapped+backfill_toofull 49 active+remapped+backfill_wait 37 peering 37 remapped+peering 31 active+remapped+backfill_wait+backfill_toofull 17 active+clean+scrubbing+deep 14 active+clean+scrubbing 4 active+undersized+degraded 2 active+undersized+degraded+remapped+backfill_wait 1 unknown 1 active+clean+remapped 1 active+remapped+backfilling 1 active+undersized+degraded+remapped+backfilling io: client: 256 MiB/s wr, 0 op/s rd, 99 op/s wr progress: Rebalancing after osd.9 marked out [==............................] As you can see there's very little progress in "Rebalancing after osd.9 marked out", the number of objects degraded dropped to 360047/153249771 and the number of objects misplaced to 1998603/153249771. I think this is very little progress in 2 days with no activity (over the weekend) on the cluster. Am 21.09.2019 um 20:39 schrieb Paul Emmerich: > On Sat, Sep 21, 2019 at 6:47 PM Thomas <74cmonty@xxxxxxxxx> wrote: >> Hello, >> >> I have re-created the OSDs using these disks. >> Can I still export the affected PGs manually? > No, the data probably cannot be recovered in this case :( > (It might still be somewhere on the disk if it hasn't been overwritten > yet, but it's virtually impossible to recover it: the metadata has > almost certainly long been overwritten) > > But that's only for 17 PGs showing up as unknown. The ones stuck in > peering can probably be revived but without the latest writes. Can you > run "ceph pg X.YX query" on one of the PGs stuck in peering? > It might tell you what's wrong and how to proceed. > > But even 17 PGs will probably affect almost all of your VMs/containers... > > Paul > >> Regards >> Thomas >> >> >> Am 20.09.19 um 21:15 schrieb Paul Emmerich: >>> On Fri, Sep 20, 2019 at 1:31 PM Thomas Schneider <74cmonty@xxxxxxxxx> wrote: >>>> Hi, >>>> >>>> I cannot get rid of >>>> pgs unknown >>>> because there were 3 disks that couldn't be started. >>>> Therefore I destroyed the relevant OSD and re-created it for the >>>> relevant disks. >>> and you had it configured to run with replica 3? Well, I guess the >>> down PGs where located on these three disks that you wiped. >>> >>> Do you still have the disks? Use ceph-objectstore-tool to export the >>> affected PGs manually and inject them into another OSD. >>> >>> >>> Paul >>> >>>> Then I added the 3 OSDs to crushmap. >>>> >>>> Regards >>>> Thomas >>>> >>>> Am 20.09.2019 um 08:19 schrieb Ashley Merrick: >>>>> Your need to fix this first. >>>>> >>>>> pgs: 0.056% pgs unknown >>>>> 0.553% pgs not active >>>>> >>>>> The back filling will cause slow I/O, but having pgs unknown and not >>>>> active will cause I/O blocking which your seeing with the VM booting. >>>>> >>>>> Seems you have 4 OSD's down, if you get them back online you should be >>>>> able to get all the PG's online. >>>>> >>>>> >>>>> ---- On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas <74cmonty@xxxxxxxxx>* >>>>> wrote ---- >>>>> >>>>> Hi, >>>>> >>>>> here I describe 1 of the 2 major issues I'm currently facing in my 8 >>>>> node ceph cluster (2x MDS, 6x ODS). >>>>> >>>>> The issue is that I cannot start any virtual machine KVM or container >>>>> LXC; the boot process just hangs after a few seconds. >>>>> All these KVMs and LXCs have in common that their virtual disks >>>>> reside >>>>> in the same pool: hdd >>>>> >>>>> This pool hdd is relatively small compared to the largest pool: >>>>> hdb_backup >>>>> root@ld3955:~# rados df >>>>> POOL_NAME USED OBJECTS CLONES COPIES >>>>> MISSING_ON_PRIMARY >>>>> UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR >>>>> UNDER COMPR >>>>> backup 0 B 0 0 >>>>> 0 >>>>> 0 0 0 0 0 B 0 0 B 0 >>>>> B 0 B >>>>> hdb_backup 589 TiB 51262212 0 >>>>> 153786636 >>>>> 0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0 >>>>> B 0 B >>>>> hdd 3.2 TiB 281884 6568 >>>>> 845652 >>>>> 0 0 1658 275277357 16 TiB 208213922 10 TiB 0 >>>>> B 0 B >>>>> pve_cephfs_data 955 GiB 91832 0 >>>>> 275496 >>>>> 0 0 3038 2103 1021 MiB 102170 318 GiB 0 >>>>> B 0 B >>>>> pve_cephfs_metadata 486 MiB 62 0 >>>>> 186 >>>>> 0 0 7 860 1.4 GiB 12393 166 MiB 0 >>>>> B 0 B >>>>> >>>>> total_objects 51635990 >>>>> total_used 597 TiB >>>>> total_avail 522 TiB >>>>> total_space 1.1 PiB >>>>> >>>>> This is the current health status of the ceph cluster: >>>>> cluster: >>>>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae >>>>> health: HEALTH_ERR >>>>> 1 filesystem is degraded >>>>> 1 MDSs report slow metadata IOs >>>>> 1 backfillfull osd(s) >>>>> 87 nearfull osd(s) >>>>> 1 pool(s) backfillfull >>>>> Reduced data availability: 54 pgs inactive, 47 pgs >>>>> peering, >>>>> 1 pg stale >>>>> Degraded data redundancy: 129598/154907946 objects >>>>> degraded >>>>> (0.084%), 33 pgs degraded, 33 pgs undersized >>>>> Degraded data redundancy (low space): 322 pgs >>>>> backfill_toofull >>>>> 1 subtrees have overcommitted pool target_size_bytes >>>>> 1 subtrees have overcommitted pool target_size_ratio >>>>> 1 pools have too many placement groups >>>>> 21 slow requests are blocked > 32 sec >>>>> >>>>> services: >>>>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h) >>>>> mgr: ld5507(active, since 16h), standbys: ld5506, ld5505 >>>>> mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby >>>>> osd: 360 osds: 356 up, 356 in; 382 remapped pgs >>>>> >>>>> data: >>>>> pools: 5 pools, 8868 pgs >>>>> objects: 51.64M objects, 197 TiB >>>>> usage: 597 TiB used, 522 TiB / 1.1 PiB avail >>>>> pgs: 0.056% pgs unknown >>>>> 0.553% pgs not active >>>>> 129598/154907946 objects degraded (0.084%) >>>>> 2211119/154907946 objects misplaced (1.427%) >>>>> 8458 active+clean >>>>> 298 active+remapped+backfill_toofull >>>>> 29 remapped+peering >>>>> 24 >>>>> active+undersized+degraded+remapped+backfill_toofull >>>>> 22 active+remapped+backfill_wait >>>>> 17 peering >>>>> 5 unknown >>>>> 5 active+recovery_wait+undersized+degraded+remapped >>>>> 3 active+undersized+degraded+remapped+backfill_wait >>>>> 2 activating+remapped >>>>> 1 active+clean+remapped >>>>> 1 stale+peering >>>>> 1 active+remapped+backfilling >>>>> 1 active+recovering+undersized+remapped >>>>> 1 active+recovery_wait+degraded >>>>> >>>>> io: >>>>> client: 9.2 KiB/s wr, 0 op/s rd, 1 op/s wr >>>>> >>>>> I believe the cluster is busy with rebalancing pool hdb_backup. >>>>> I set the balance mode upmap recently after the 589TB data was >>>>> written. >>>>> root@ld3955:~# ceph balancer status >>>>> { >>>>> "active": true, >>>>> "plans": [], >>>>> "mode": "upmap" >>>>> } >>>>> >>>>> >>>>> In order to resolve the issue with pool hdd I started some >>>>> investigation. >>>>> First step was to install drivers for the NIC provided Mellanox. >>>>> Then I configured some kernel parameters recommended >>>>> <https://community.mellanox.com/s/article/linux-sysctl-tuning> by >>>>> Mellanox. >>>>> >>>>> However this didn't fix the issue. >>>>> In my opinion I must get rid of all "slow requests are blocked". >>>>> >>>>> When I check the output of ceph health detail any OSD listed under >>>>> REQUEST_SLOW points to an OSD that belongs to pool hdd. >>>>> This means none of the disks belonging to pool hdb_backup is >>>>> showing a >>>>> comparable behaviour. >>>>> >>>>> Then I checked the running processes on the different OSD nodes; I >>>>> use >>>>> tool "glances" here. >>>>> Here I can see single processes that are running for hours and >>>>> consuming >>>>> much CPU, e.g. >>>>> 66.8 0.2 2.13G 1.17G 1192756 ceph 17h8:33 58 0 S >>>>> 41M 2K >>>>> /usr/bin/ceph-osd -f --cluster ceph --id 37 --setuser ceph >>>>> --setgroup ceph >>>>> 34.2 0.2 4.31G 1.20G 971267 ceph 15h38:46 58 0 S >>>>> 14M 3K >>>>> /usr/bin/ceph-osd -f --cluster ceph --id 73 --setuser ceph >>>>> --setgroup ceph >>>>> >>>>> Similar processes are running on 4 OSD nodes. >>>>> All processes have in common that the relevant OSD belongs to pool >>>>> hdd. >>>>> >>>>> Furthermore glances gives me this alert: >>>>> CRITICAL on CPU_IOWAIT (Min:1.9 Mean:2.3 Max:2.6): ceph-osd, >>>>> ceph-osd, >>>>> ceph-osd >>>>> >>>>> What can / should I do now? >>>>> Kill the long running processes? >>>>> Stop the relevant OSDs? >>>>> >>>>> Please advise? >>>>> >>>>> THX >>>>> Thomas >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> <mailto:ceph-users@xxxxxxx> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> <mailto:ceph-users-leave@xxxxxxx> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx