None of the KVM / LXC instances is starting. Any KVM / LXC instance is using RBD. The same pool hdd is providing CephFS service, but this is only used for storing KVM / LXC instance backups, ISOs and other files. Am 23.09.2019 um 08:55 schrieb Ashley Merrick: > Have you been able to start the VMs now? > > Are you using RBD or are the VM's hosted on a CephFS? > > > ---- On Mon, 23 Sep 2019 14:16:47 +0800 *Thomas Schneider > <74cmonty@xxxxxxxxx>* wrote ---- > > Hi, > > currently ceph -s is not reporting any unknown PGs. > The following flags are set: nobackfill, norebalance, norecover > There are no PGs in stuck peering either. > And there's very little traffic on the ceph network. > > This is the output of today: > root@ld3955:~# ceph -s > cluster: > id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > health: HEALTH_ERR > 1 filesystem is degraded > 1 filesystem has a failed mds daemon > 1 filesystem is offline > insufficient standby MDS daemons available > nobackfill,norebalance,norecover flag(s) set > 83 nearfull osd(s) > 1 pool(s) nearfull > Degraded data redundancy: 360047/153249771 objects > degraded > (0.235%), 78 pgs degraded, 81 pgs undersized > Degraded data redundancy (low space): 265 pgs > backfill_toofull > 3 pools have too many placement groups > > services: > mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 109m) > mgr: ld5505(active, since 2d), standbys: ld5506, ld5507 > mds: pve_cephfs:0/1, 1 failed > osd: 368 osds: 368 up, 367 in; 398 remapped pgs > flags nobackfill,norebalance,norecover > > data: > pools: 5 pools, 8868 pgs > objects: 51.08M objects, 195 TiB > usage: 590 TiB used, 562 TiB / 1.1 PiB avail > pgs: 360047/153249771 objects degraded (0.235%) > 1998603/153249771 objects misplaced (1.304%) > 8469 active+clean > 124 active+remapped+backfill_toofull > 83 active+remapped+backfill_wait+backfill_toofull > 77 active+remapped+backfill_wait > 45 > active+undersized+degraded+remapped+backfill_toofull > 33 active+remapped+backfilling > 13 > active+undersized+degraded+remapped+backfill_wait+backfill_toofull > 11 active+undersized+degraded+remapped+backfilling > 4 active+undersized+degraded+remapped+backfill_wait > 4 active+recovery_wait+undersized+degraded+remapped > 3 active+recovering+undersized+remapped > 1 active+recovering+undersized+degraded+remapped > 1 active+recovering > > io: > client: 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr > > progress: > Rebalancing after osd.9 marked out > [====================..........] > > > 2 days before the output was: > root@ld3955:~# ceph -s > cluster: > id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > health: HEALTH_ERR > 1 filesystem is degraded > 1 filesystem has a failed mds daemon > 1 filesystem is offline > insufficient standby MDS daemons available > nobackfill,norebalance,norecover flag(s) set > 2 backfillfull osd(s) > 86 nearfull osd(s) > 1 pool(s) backfillfull > Reduced data availability: 75 pgs inactive, 74 pgs > peering > Degraded data redundancy: 364117/154942251 objects > degraded > (0.235%), 76 pgs degraded, 76 pgs undersized > Degraded data redundancy (low space): 309 pgs > backfill_toofull > 3 pools have too many placement groups > 105 slow requests are blocked > 32 sec > 91 stuck requests are blocked > 4096 sec > > services: > mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 9h) > mgr: ld5505(active, since 9h), standbys: ld5506, ld5507 > mds: pve_cephfs:0/1, 1 failed > osd: 368 osds: 368 up, 367 in; 400 remapped pgs > flags nobackfill,norebalance,norecover > > data: > pools: 5 pools, 8868 pgs > objects: 51.65M objects, 197 TiB > usage: 596 TiB used, 554 TiB / 1.1 PiB avail > pgs: 0.011% pgs unknown > 0.834% pgs not active > 364117/154942251 objects degraded (0.235%) > 2003579/154942251 objects misplaced (1.293%) > 8395 active+clean > 209 active+remapped+backfill_toofull > 69 > active+undersized+degraded+remapped+backfill_toofull > 49 active+remapped+backfill_wait > 37 peering > 37 remapped+peering > 31 active+remapped+backfill_wait+backfill_toofull > 17 active+clean+scrubbing+deep > 14 active+clean+scrubbing > 4 active+undersized+degraded > 2 active+undersized+degraded+remapped+backfill_wait > 1 unknown > 1 active+clean+remapped > 1 active+remapped+backfilling > 1 active+undersized+degraded+remapped+backfilling > > io: > client: 256 MiB/s wr, 0 op/s rd, 99 op/s wr > > progress: > Rebalancing after osd.9 marked out > [==............................] > > > As you can see there's very little progress in "Rebalancing after > osd.9 > marked out", the number of objects degraded dropped to > 360047/153249771 > and the number of objects misplaced to 1998603/153249771. > > I think this is very little progress in 2 days with no activity (over > the weekend) on the cluster. > > > > Am 21.09.2019 um 20:39 schrieb Paul Emmerich: > > On Sat, Sep 21, 2019 at 6:47 PM Thomas <74cmonty@xxxxxxxxx > <mailto:74cmonty@xxxxxxxxx>> wrote: > >> Hello, > >> > >> I have re-created the OSDs using these disks. > >> Can I still export the affected PGs manually? > > No, the data probably cannot be recovered in this case :( > > (It might still be somewhere on the disk if it hasn't been > overwritten > > yet, but it's virtually impossible to recover it: the metadata has > > almost certainly long been overwritten) > > > > But that's only for 17 PGs showing up as unknown. The ones stuck in > > peering can probably be revived but without the latest writes. > Can you > > run "ceph pg X.YX query" on one of the PGs stuck in peering? > > It might tell you what's wrong and how to proceed. > > > > But even 17 PGs will probably affect almost all of your > VMs/containers... > > > > Paul > > > >> Regards > >> Thomas > >> > >> > >> Am 20.09.19 um 21:15 schrieb Paul Emmerich: > >>> On Fri, Sep 20, 2019 at 1:31 PM Thomas Schneider > <74cmonty@xxxxxxxxx <mailto:74cmonty@xxxxxxxxx>> wrote: > >>>> Hi, > >>>> > >>>> I cannot get rid of > >>>> pgs unknown > >>>> because there were 3 disks that couldn't be started. > >>>> Therefore I destroyed the relevant OSD and re-created it for the > >>>> relevant disks. > >>> and you had it configured to run with replica 3? Well, I guess > the > >>> down PGs where located on these three disks that you wiped. > >>> > >>> Do you still have the disks? Use ceph-objectstore-tool to > export the > >>> affected PGs manually and inject them into another OSD. > >>> > >>> > >>> Paul > >>> > >>>> Then I added the 3 OSDs to crushmap. > >>>> > >>>> Regards > >>>> Thomas > >>>> > >>>> Am 20.09.2019 um 08:19 schrieb Ashley Merrick: > >>>>> Your need to fix this first. > >>>>> > >>>>> pgs: 0.056% pgs unknown > >>>>> 0.553% pgs not active > >>>>> > >>>>> The back filling will cause slow I/O, but having pgs unknown > and not > >>>>> active will cause I/O blocking which your seeing with the VM > booting. > >>>>> > >>>>> Seems you have 4 OSD's down, if you get them back online you > should be > >>>>> able to get all the PG's online. > >>>>> > >>>>> > >>>>> ---- On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas > <74cmonty@xxxxxxxxx <mailto:74cmonty@xxxxxxxxx>>* > >>>>> wrote ---- > >>>>> > >>>>> Hi, > >>>>> > >>>>> here I describe 1 of the 2 major issues I'm currently facing > in my 8 > >>>>> node ceph cluster (2x MDS, 6x ODS). > >>>>> > >>>>> The issue is that I cannot start any virtual machine KVM or > container > >>>>> LXC; the boot process just hangs after a few seconds. > >>>>> All these KVMs and LXCs have in common that their virtual disks > >>>>> reside > >>>>> in the same pool: hdd > >>>>> > >>>>> This pool hdd is relatively small compared to the largest pool: > >>>>> hdb_backup > >>>>> root@ld3955:~# rados df > >>>>> POOL_NAME USED OBJECTS CLONES COPIES > >>>>> MISSING_ON_PRIMARY > >>>>> UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR > >>>>> UNDER COMPR > >>>>> backup 0 B 0 0 > >>>>> 0 > >>>>> 0 0 0 0 0 B 0 0 B 0 > >>>>> B 0 B > >>>>> hdb_backup 589 TiB 51262212 0 > >>>>> 153786636 > >>>>> 0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0 > >>>>> B 0 B > >>>>> hdd 3.2 TiB 281884 6568 > >>>>> 845652 > >>>>> 0 0 1658 275277357 16 TiB 208213922 10 TiB 0 > >>>>> B 0 B > >>>>> pve_cephfs_data 955 GiB 91832 0 > >>>>> 275496 > >>>>> 0 0 3038 2103 1021 MiB 102170 318 GiB 0 > >>>>> B 0 B > >>>>> pve_cephfs_metadata 486 MiB 62 0 > >>>>> 186 > >>>>> 0 0 7 860 1.4 GiB 12393 166 MiB 0 > >>>>> B 0 B > >>>>> > >>>>> total_objects 51635990 > >>>>> total_used 597 TiB > >>>>> total_avail 522 TiB > >>>>> total_space 1.1 PiB > >>>>> > >>>>> This is the current health status of the ceph cluster: > >>>>> cluster: > >>>>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > >>>>> health: HEALTH_ERR > >>>>> 1 filesystem is degraded > >>>>> 1 MDSs report slow metadata IOs > >>>>> 1 backfillfull osd(s) > >>>>> 87 nearfull osd(s) > >>>>> 1 pool(s) backfillfull > >>>>> Reduced data availability: 54 pgs inactive, 47 pgs > >>>>> peering, > >>>>> 1 pg stale > >>>>> Degraded data redundancy: 129598/154907946 objects > >>>>> degraded > >>>>> (0.084%), 33 pgs degraded, 33 pgs undersized > >>>>> Degraded data redundancy (low space): 322 pgs > >>>>> backfill_toofull > >>>>> 1 subtrees have overcommitted pool target_size_bytes > >>>>> 1 subtrees have overcommitted pool target_size_ratio > >>>>> 1 pools have too many placement groups > >>>>> 21 slow requests are blocked > 32 sec > >>>>> > >>>>> services: > >>>>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h) > >>>>> mgr: ld5507(active, since 16h), standbys: ld5506, ld5505 > >>>>> mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby > >>>>> osd: 360 osds: 356 up, 356 in; 382 remapped pgs > >>>>> > >>>>> data: > >>>>> pools: 5 pools, 8868 pgs > >>>>> objects: 51.64M objects, 197 TiB > >>>>> usage: 597 TiB used, 522 TiB / 1.1 PiB avail > >>>>> pgs: 0.056% pgs unknown > >>>>> 0.553% pgs not active > >>>>> 129598/154907946 objects degraded (0.084%) > >>>>> 2211119/154907946 objects misplaced (1.427%) > >>>>> 8458 active+clean > >>>>> 298 active+remapped+backfill_toofull > >>>>> 29 remapped+peering > >>>>> 24 > >>>>> active+undersized+degraded+remapped+backfill_toofull > >>>>> 22 active+remapped+backfill_wait > >>>>> 17 peering > >>>>> 5 unknown > >>>>> 5 active+recovery_wait+undersized+degraded+remapped > >>>>> 3 active+undersized+degraded+remapped+backfill_wait > >>>>> 2 activating+remapped > >>>>> 1 active+clean+remapped > >>>>> 1 stale+peering > >>>>> 1 active+remapped+backfilling > >>>>> 1 active+recovering+undersized+remapped > >>>>> 1 active+recovery_wait+degraded > >>>>> > >>>>> io: > >>>>> client: 9.2 KiB/s wr, 0 op/s rd, 1 op/s wr > >>>>> > >>>>> I believe the cluster is busy with rebalancing pool hdb_backup. > >>>>> I set the balance mode upmap recently after the 589TB data was > >>>>> written. > >>>>> root@ld3955:~# ceph balancer status > >>>>> { > >>>>> "active": true, > >>>>> "plans": [], > >>>>> "mode": "upmap" > >>>>> } > >>>>> > >>>>> > >>>>> In order to resolve the issue with pool hdd I started some > >>>>> investigation. > >>>>> First step was to install drivers for the NIC provided > Mellanox. > >>>>> Then I configured some kernel parameters recommended > >>>>> > <https://community.mellanox.com/s/article/linux-sysctl-tuning> by > >>>>> Mellanox. > >>>>> > >>>>> However this didn't fix the issue. > >>>>> In my opinion I must get rid of all "slow requests are > blocked". > >>>>> > >>>>> When I check the output of ceph health detail any OSD listed > under > >>>>> REQUEST_SLOW points to an OSD that belongs to pool hdd. > >>>>> This means none of the disks belonging to pool hdb_backup is > >>>>> showing a > >>>>> comparable behaviour. > >>>>> > >>>>> Then I checked the running processes on the different OSD > nodes; I > >>>>> use > >>>>> tool "glances" here. > >>>>> Here I can see single processes that are running for hours and > >>>>> consuming > >>>>> much CPU, e.g. > >>>>> 66.8 0.2 2.13G 1.17G 1192756 ceph 17h8:33 58 0 S > >>>>> 41M 2K > >>>>> /usr/bin/ceph-osd -f --cluster ceph --id 37 --setuser ceph > >>>>> --setgroup ceph > >>>>> 34.2 0.2 4.31G 1.20G 971267 ceph 15h38:46 58 0 S > >>>>> 14M 3K > >>>>> /usr/bin/ceph-osd -f --cluster ceph --id 73 --setuser ceph > >>>>> --setgroup ceph > >>>>> > >>>>> Similar processes are running on 4 OSD nodes. > >>>>> All processes have in common that the relevant OSD belongs > to pool > >>>>> hdd. > >>>>> > >>>>> Furthermore glances gives me this alert: > >>>>> CRITICAL on CPU_IOWAIT (Min:1.9 Mean:2.3 Max:2.6): ceph-osd, > >>>>> ceph-osd, > >>>>> ceph-osd > >>>>> > >>>>> What can / should I do now? > >>>>> Kill the long running processes? > >>>>> Stop the relevant OSDs? > >>>>> > >>>>> Please advise? > >>>>> > >>>>> THX > >>>>> Thomas > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx> > >>>>> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > <mailto:ceph-users-leave@xxxxxxx> > >>>>> <mailto:ceph-users-leave@xxxxxxx > <mailto:ceph-users-leave@xxxxxxx>> > >>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx> > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > <mailto:ceph-users-leave@xxxxxxx> > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx