On Sat, Sep 21, 2019 at 6:47 PM Thomas <74cmonty@xxxxxxxxx> wrote: > > Hello, > > I have re-created the OSDs using these disks. > Can I still export the affected PGs manually? No, the data probably cannot be recovered in this case :( (It might still be somewhere on the disk if it hasn't been overwritten yet, but it's virtually impossible to recover it: the metadata has almost certainly long been overwritten) But that's only for 17 PGs showing up as unknown. The ones stuck in peering can probably be revived but without the latest writes. Can you run "ceph pg X.YX query" on one of the PGs stuck in peering? It might tell you what's wrong and how to proceed. But even 17 PGs will probably affect almost all of your VMs/containers... Paul > > Regards > Thomas > > > Am 20.09.19 um 21:15 schrieb Paul Emmerich: > > On Fri, Sep 20, 2019 at 1:31 PM Thomas Schneider <74cmonty@xxxxxxxxx> wrote: > >> Hi, > >> > >> I cannot get rid of > >> pgs unknown > >> because there were 3 disks that couldn't be started. > >> Therefore I destroyed the relevant OSD and re-created it for the > >> relevant disks. > > and you had it configured to run with replica 3? Well, I guess the > > down PGs where located on these three disks that you wiped. > > > > Do you still have the disks? Use ceph-objectstore-tool to export the > > affected PGs manually and inject them into another OSD. > > > > > > Paul > > > >> Then I added the 3 OSDs to crushmap. > >> > >> Regards > >> Thomas > >> > >> Am 20.09.2019 um 08:19 schrieb Ashley Merrick: > >>> Your need to fix this first. > >>> > >>> pgs: 0.056% pgs unknown > >>> 0.553% pgs not active > >>> > >>> The back filling will cause slow I/O, but having pgs unknown and not > >>> active will cause I/O blocking which your seeing with the VM booting. > >>> > >>> Seems you have 4 OSD's down, if you get them back online you should be > >>> able to get all the PG's online. > >>> > >>> > >>> ---- On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas <74cmonty@xxxxxxxxx>* > >>> wrote ---- > >>> > >>> Hi, > >>> > >>> here I describe 1 of the 2 major issues I'm currently facing in my 8 > >>> node ceph cluster (2x MDS, 6x ODS). > >>> > >>> The issue is that I cannot start any virtual machine KVM or container > >>> LXC; the boot process just hangs after a few seconds. > >>> All these KVMs and LXCs have in common that their virtual disks > >>> reside > >>> in the same pool: hdd > >>> > >>> This pool hdd is relatively small compared to the largest pool: > >>> hdb_backup > >>> root@ld3955:~# rados df > >>> POOL_NAME USED OBJECTS CLONES COPIES > >>> MISSING_ON_PRIMARY > >>> UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR > >>> UNDER COMPR > >>> backup 0 B 0 0 > >>> 0 > >>> 0 0 0 0 0 B 0 0 B 0 > >>> B 0 B > >>> hdb_backup 589 TiB 51262212 0 > >>> 153786636 > >>> 0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0 > >>> B 0 B > >>> hdd 3.2 TiB 281884 6568 > >>> 845652 > >>> 0 0 1658 275277357 16 TiB 208213922 10 TiB 0 > >>> B 0 B > >>> pve_cephfs_data 955 GiB 91832 0 > >>> 275496 > >>> 0 0 3038 2103 1021 MiB 102170 318 GiB 0 > >>> B 0 B > >>> pve_cephfs_metadata 486 MiB 62 0 > >>> 186 > >>> 0 0 7 860 1.4 GiB 12393 166 MiB 0 > >>> B 0 B > >>> > >>> total_objects 51635990 > >>> total_used 597 TiB > >>> total_avail 522 TiB > >>> total_space 1.1 PiB > >>> > >>> This is the current health status of the ceph cluster: > >>> cluster: > >>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > >>> health: HEALTH_ERR > >>> 1 filesystem is degraded > >>> 1 MDSs report slow metadata IOs > >>> 1 backfillfull osd(s) > >>> 87 nearfull osd(s) > >>> 1 pool(s) backfillfull > >>> Reduced data availability: 54 pgs inactive, 47 pgs > >>> peering, > >>> 1 pg stale > >>> Degraded data redundancy: 129598/154907946 objects > >>> degraded > >>> (0.084%), 33 pgs degraded, 33 pgs undersized > >>> Degraded data redundancy (low space): 322 pgs > >>> backfill_toofull > >>> 1 subtrees have overcommitted pool target_size_bytes > >>> 1 subtrees have overcommitted pool target_size_ratio > >>> 1 pools have too many placement groups > >>> 21 slow requests are blocked > 32 sec > >>> > >>> services: > >>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h) > >>> mgr: ld5507(active, since 16h), standbys: ld5506, ld5505 > >>> mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby > >>> osd: 360 osds: 356 up, 356 in; 382 remapped pgs > >>> > >>> data: > >>> pools: 5 pools, 8868 pgs > >>> objects: 51.64M objects, 197 TiB > >>> usage: 597 TiB used, 522 TiB / 1.1 PiB avail > >>> pgs: 0.056% pgs unknown > >>> 0.553% pgs not active > >>> 129598/154907946 objects degraded (0.084%) > >>> 2211119/154907946 objects misplaced (1.427%) > >>> 8458 active+clean > >>> 298 active+remapped+backfill_toofull > >>> 29 remapped+peering > >>> 24 > >>> active+undersized+degraded+remapped+backfill_toofull > >>> 22 active+remapped+backfill_wait > >>> 17 peering > >>> 5 unknown > >>> 5 active+recovery_wait+undersized+degraded+remapped > >>> 3 active+undersized+degraded+remapped+backfill_wait > >>> 2 activating+remapped > >>> 1 active+clean+remapped > >>> 1 stale+peering > >>> 1 active+remapped+backfilling > >>> 1 active+recovering+undersized+remapped > >>> 1 active+recovery_wait+degraded > >>> > >>> io: > >>> client: 9.2 KiB/s wr, 0 op/s rd, 1 op/s wr > >>> > >>> I believe the cluster is busy with rebalancing pool hdb_backup. > >>> I set the balance mode upmap recently after the 589TB data was > >>> written. > >>> root@ld3955:~# ceph balancer status > >>> { > >>> "active": true, > >>> "plans": [], > >>> "mode": "upmap" > >>> } > >>> > >>> > >>> In order to resolve the issue with pool hdd I started some > >>> investigation. > >>> First step was to install drivers for the NIC provided Mellanox. > >>> Then I configured some kernel parameters recommended > >>> <https://community.mellanox.com/s/article/linux-sysctl-tuning> by > >>> Mellanox. > >>> > >>> However this didn't fix the issue. > >>> In my opinion I must get rid of all "slow requests are blocked". > >>> > >>> When I check the output of ceph health detail any OSD listed under > >>> REQUEST_SLOW points to an OSD that belongs to pool hdd. > >>> This means none of the disks belonging to pool hdb_backup is > >>> showing a > >>> comparable behaviour. > >>> > >>> Then I checked the running processes on the different OSD nodes; I > >>> use > >>> tool "glances" here. > >>> Here I can see single processes that are running for hours and > >>> consuming > >>> much CPU, e.g. > >>> 66.8 0.2 2.13G 1.17G 1192756 ceph 17h8:33 58 0 S > >>> 41M 2K > >>> /usr/bin/ceph-osd -f --cluster ceph --id 37 --setuser ceph > >>> --setgroup ceph > >>> 34.2 0.2 4.31G 1.20G 971267 ceph 15h38:46 58 0 S > >>> 14M 3K > >>> /usr/bin/ceph-osd -f --cluster ceph --id 73 --setuser ceph > >>> --setgroup ceph > >>> > >>> Similar processes are running on 4 OSD nodes. > >>> All processes have in common that the relevant OSD belongs to pool > >>> hdd. > >>> > >>> Furthermore glances gives me this alert: > >>> CRITICAL on CPU_IOWAIT (Min:1.9 Mean:2.3 Max:2.6): ceph-osd, > >>> ceph-osd, > >>> ceph-osd > >>> > >>> What can / should I do now? > >>> Kill the long running processes? > >>> Stop the relevant OSDs? > >>> > >>> Please advise? > >>> > >>> THX > >>> Thomas > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> <mailto:ceph-users@xxxxxxx> > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> <mailto:ceph-users-leave@xxxxxxx> > >>> > >>> > >>> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx