Re: Cannot start virtual machines KVM / LXC

Thomas Schneider <74cmonty@xxxxxxxxx> · Mon, 23 Sep 2019 09:00:30 +0200

None of the KVM / LXC instances is starting.
Any KVM / LXC instance is using RBD.

The same pool hdd is providing CephFS service, but this is only used for
storing KVM / LXC instance backups, ISOs and other files.

Am 23.09.2019 um 08:55 schrieb Ashley Merrick:
> Have you been able to start the VMs now?
>
> Are you using RBD or are the VM's hosted on a CephFS?
>
>
> ---- On Mon, 23 Sep 2019 14:16:47 +0800 *Thomas Schneider
> <74cmonty@xxxxxxxxx>* wrote ----
>
>     Hi,
>
>     currently ceph -s is not reporting any unknown PGs.
>     The following flags are set: nobackfill, norebalance, norecover
>     There are no PGs in stuck peering either.
>     And there's very little traffic on the ceph network.
>
>     This is the output of today:
>     root@ld3955:~# ceph -s
>       cluster:
>         id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>         health: HEALTH_ERR
>                 1 filesystem is degraded
>                 1 filesystem has a failed mds daemon
>                 1 filesystem is offline
>                 insufficient standby MDS daemons available
>                 nobackfill,norebalance,norecover flag(s) set
>                 83 nearfull osd(s)
>                 1 pool(s) nearfull
>                 Degraded data redundancy: 360047/153249771 objects
>     degraded
>     (0.235%), 78 pgs degraded, 81 pgs undersized
>                 Degraded data redundancy (low space): 265 pgs
>     backfill_toofull
>                 3 pools have too many placement groups
>
>       services:
>         mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 109m)
>         mgr: ld5505(active, since 2d), standbys: ld5506, ld5507
>         mds: pve_cephfs:0/1, 1 failed
>         osd: 368 osds: 368 up, 367 in; 398 remapped pgs
>              flags nobackfill,norebalance,norecover
>
>       data:
>         pools:   5 pools, 8868 pgs
>         objects: 51.08M objects, 195 TiB
>         usage:   590 TiB used, 562 TiB / 1.1 PiB avail
>         pgs:     360047/153249771 objects degraded (0.235%)
>                  1998603/153249771 objects misplaced (1.304%)
>                  8469 active+clean
>                  124  active+remapped+backfill_toofull
>                  83   active+remapped+backfill_wait+backfill_toofull
>                  77   active+remapped+backfill_wait
>                  45  
>     active+undersized+degraded+remapped+backfill_toofull
>                  33   active+remapped+backfilling
>                  13  
>     active+undersized+degraded+remapped+backfill_wait+backfill_toofull
>                  11   active+undersized+degraded+remapped+backfilling
>                  4    active+undersized+degraded+remapped+backfill_wait
>                  4    active+recovery_wait+undersized+degraded+remapped
>                  3    active+recovering+undersized+remapped
>                  1    active+recovering+undersized+degraded+remapped
>                  1    active+recovering
>
>       io:
>         client:   5.6 KiB/s wr, 0 op/s rd, 0 op/s wr
>
>       progress:
>         Rebalancing after osd.9 marked out
>           [====================..........]
>
>
>     2 days before the output was:
>     root@ld3955:~# ceph -s
>       cluster:
>         id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>         health: HEALTH_ERR
>                 1 filesystem is degraded
>                 1 filesystem has a failed mds daemon
>                 1 filesystem is offline
>                 insufficient standby MDS daemons available
>                 nobackfill,norebalance,norecover flag(s) set
>                 2 backfillfull osd(s)
>                 86 nearfull osd(s)
>                 1 pool(s) backfillfull
>                 Reduced data availability: 75 pgs inactive, 74 pgs
>     peering
>                 Degraded data redundancy: 364117/154942251 objects
>     degraded
>     (0.235%), 76 pgs degraded, 76 pgs undersized
>                 Degraded data redundancy (low space): 309 pgs
>     backfill_toofull
>                 3 pools have too many placement groups
>                 105 slow requests are blocked > 32 sec
>                 91 stuck requests are blocked > 4096 sec
>
>       services:
>         mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 9h)
>         mgr: ld5505(active, since 9h), standbys: ld5506, ld5507
>         mds: pve_cephfs:0/1, 1 failed
>         osd: 368 osds: 368 up, 367 in; 400 remapped pgs
>              flags nobackfill,norebalance,norecover
>
>       data:
>         pools:   5 pools, 8868 pgs
>         objects: 51.65M objects, 197 TiB
>         usage:   596 TiB used, 554 TiB / 1.1 PiB avail
>         pgs:     0.011% pgs unknown
>                  0.834% pgs not active
>                  364117/154942251 objects degraded (0.235%)
>                  2003579/154942251 objects misplaced (1.293%)
>                  8395 active+clean
>                  209  active+remapped+backfill_toofull
>                  69  
>     active+undersized+degraded+remapped+backfill_toofull
>                  49   active+remapped+backfill_wait
>                  37   peering
>                  37   remapped+peering
>                  31   active+remapped+backfill_wait+backfill_toofull
>                  17   active+clean+scrubbing+deep
>                  14   active+clean+scrubbing
>                  4    active+undersized+degraded
>                  2    active+undersized+degraded+remapped+backfill_wait
>                  1    unknown
>                  1    active+clean+remapped
>                  1    active+remapped+backfilling
>                  1    active+undersized+degraded+remapped+backfilling
>
>       io:
>         client:   256 MiB/s wr, 0 op/s rd, 99 op/s wr
>
>       progress:
>         Rebalancing after osd.9 marked out
>           [==............................]
>
>
>     As you can see there's very little progress in "Rebalancing after
>     osd.9
>     marked out", the number of objects degraded dropped to
>     360047/153249771
>     and the number of objects misplaced to 1998603/153249771.
>
>     I think this is very little progress in 2 days with no activity (over
>     the weekend) on the cluster.
>
>
>
>     Am 21.09.2019 um 20:39 schrieb Paul Emmerich:
>     > On Sat, Sep 21, 2019 at 6:47 PM Thomas <74cmonty@xxxxxxxxx
>     <mailto:74cmonty@xxxxxxxxx>> wrote:
>     >> Hello,
>     >>
>     >> I have re-created the OSDs using these disks.
>     >> Can I still export the affected PGs manually?
>     > No, the data probably cannot be recovered in this case :(
>     > (It might still be somewhere on the disk if it hasn't been
>     overwritten
>     > yet, but it's virtually impossible to recover it: the metadata has
>     > almost certainly long been overwritten)
>     >
>     > But that's only for 17 PGs showing up as unknown. The ones stuck in
>     > peering can probably be revived but without the latest writes.
>     Can you
>     > run "ceph pg X.YX query" on one of the PGs stuck in peering?
>     > It might tell you what's wrong and how to proceed.
>     >
>     > But even 17 PGs will probably affect almost all of your
>     VMs/containers...
>     >
>     > Paul
>     >
>     >> Regards
>     >> Thomas
>     >>
>     >>
>     >> Am 20.09.19 um 21:15 schrieb Paul Emmerich:
>     >>> On Fri, Sep 20, 2019 at 1:31 PM Thomas Schneider
>     <74cmonty@xxxxxxxxx <mailto:74cmonty@xxxxxxxxx>> wrote:
>     >>>> Hi,
>     >>>>
>     >>>> I cannot get rid of
>     >>>> pgs unknown
>     >>>> because there were 3 disks that couldn't be started.
>     >>>> Therefore I destroyed the relevant OSD and re-created it for the
>     >>>> relevant disks.
>     >>> and you had it configured to run with replica 3? Well, I guess
>     the
>     >>> down PGs where located on these three disks that you wiped.
>     >>>
>     >>> Do you still have the disks? Use ceph-objectstore-tool to
>     export the
>     >>> affected PGs manually and inject them into another OSD.
>     >>>
>     >>>
>     >>> Paul
>     >>>
>     >>>> Then I added the 3 OSDs to crushmap.
>     >>>>
>     >>>> Regards
>     >>>> Thomas
>     >>>>
>     >>>> Am 20.09.2019 um 08:19 schrieb Ashley Merrick:
>     >>>>> Your need to fix this first.
>     >>>>>
>     >>>>> pgs: 0.056% pgs unknown
>     >>>>> 0.553% pgs not active
>     >>>>>
>     >>>>> The back filling will cause slow I/O, but having pgs unknown
>     and not
>     >>>>> active will cause I/O blocking which your seeing with the VM
>     booting.
>     >>>>>
>     >>>>> Seems you have 4 OSD's down, if you get them back online you
>     should be
>     >>>>> able to get all the PG's online.
>     >>>>>
>     >>>>>
>     >>>>> ---- On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas
>     <74cmonty@xxxxxxxxx <mailto:74cmonty@xxxxxxxxx>>*
>     >>>>> wrote ----
>     >>>>>
>     >>>>> Hi,
>     >>>>>
>     >>>>> here I describe 1 of the 2 major issues I'm currently facing
>     in my 8
>     >>>>> node ceph cluster (2x MDS, 6x ODS).
>     >>>>>
>     >>>>> The issue is that I cannot start any virtual machine KVM or
>     container
>     >>>>> LXC; the boot process just hangs after a few seconds.
>     >>>>> All these KVMs and LXCs have in common that their virtual disks
>     >>>>> reside
>     >>>>> in the same pool: hdd
>     >>>>>
>     >>>>> This pool hdd is relatively small compared to the largest pool:
>     >>>>> hdb_backup
>     >>>>> root@ld3955:~# rados df
>     >>>>> POOL_NAME USED OBJECTS CLONES COPIES
>     >>>>> MISSING_ON_PRIMARY
>     >>>>> UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR
>     >>>>> UNDER COMPR
>     >>>>> backup 0 B 0 0
>     >>>>> 0
>     >>>>> 0 0 0 0 0 B 0 0 B 0
>     >>>>> B 0 B
>     >>>>> hdb_backup 589 TiB 51262212 0
>     >>>>> 153786636
>     >>>>> 0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0
>     >>>>> B 0 B
>     >>>>> hdd 3.2 TiB 281884 6568
>     >>>>> 845652
>     >>>>> 0 0 1658 275277357 16 TiB 208213922 10 TiB 0
>     >>>>> B 0 B
>     >>>>> pve_cephfs_data 955 GiB 91832 0
>     >>>>> 275496
>     >>>>> 0 0 3038 2103 1021 MiB 102170 318 GiB 0
>     >>>>> B 0 B
>     >>>>> pve_cephfs_metadata 486 MiB 62 0
>     >>>>> 186
>     >>>>> 0 0 7 860 1.4 GiB 12393 166 MiB 0
>     >>>>> B 0 B
>     >>>>>
>     >>>>> total_objects 51635990
>     >>>>> total_used 597 TiB
>     >>>>> total_avail 522 TiB
>     >>>>> total_space 1.1 PiB
>     >>>>>
>     >>>>> This is the current health status of the ceph cluster:
>     >>>>> cluster:
>     >>>>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>     >>>>> health: HEALTH_ERR
>     >>>>> 1 filesystem is degraded
>     >>>>> 1 MDSs report slow metadata IOs
>     >>>>> 1 backfillfull osd(s)
>     >>>>> 87 nearfull osd(s)
>     >>>>> 1 pool(s) backfillfull
>     >>>>> Reduced data availability: 54 pgs inactive, 47 pgs
>     >>>>> peering,
>     >>>>> 1 pg stale
>     >>>>> Degraded data redundancy: 129598/154907946 objects
>     >>>>> degraded
>     >>>>> (0.084%), 33 pgs degraded, 33 pgs undersized
>     >>>>> Degraded data redundancy (low space): 322 pgs
>     >>>>> backfill_toofull
>     >>>>> 1 subtrees have overcommitted pool target_size_bytes
>     >>>>> 1 subtrees have overcommitted pool target_size_ratio
>     >>>>> 1 pools have too many placement groups
>     >>>>> 21 slow requests are blocked > 32 sec
>     >>>>>
>     >>>>> services:
>     >>>>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h)
>     >>>>> mgr: ld5507(active, since 16h), standbys: ld5506, ld5505
>     >>>>> mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby
>     >>>>> osd: 360 osds: 356 up, 356 in; 382 remapped pgs
>     >>>>>
>     >>>>> data:
>     >>>>> pools: 5 pools, 8868 pgs
>     >>>>> objects: 51.64M objects, 197 TiB
>     >>>>> usage: 597 TiB used, 522 TiB / 1.1 PiB avail
>     >>>>> pgs: 0.056% pgs unknown
>     >>>>> 0.553% pgs not active
>     >>>>> 129598/154907946 objects degraded (0.084%)
>     >>>>> 2211119/154907946 objects misplaced (1.427%)
>     >>>>> 8458 active+clean
>     >>>>> 298 active+remapped+backfill_toofull
>     >>>>> 29 remapped+peering
>     >>>>> 24
>     >>>>> active+undersized+degraded+remapped+backfill_toofull
>     >>>>> 22 active+remapped+backfill_wait
>     >>>>> 17 peering
>     >>>>> 5 unknown
>     >>>>> 5 active+recovery_wait+undersized+degraded+remapped
>     >>>>> 3 active+undersized+degraded+remapped+backfill_wait
>     >>>>> 2 activating+remapped
>     >>>>> 1 active+clean+remapped
>     >>>>> 1 stale+peering
>     >>>>> 1 active+remapped+backfilling
>     >>>>> 1 active+recovering+undersized+remapped
>     >>>>> 1 active+recovery_wait+degraded
>     >>>>>
>     >>>>> io:
>     >>>>> client: 9.2 KiB/s wr, 0 op/s rd, 1 op/s wr
>     >>>>>
>     >>>>> I believe the cluster is busy with rebalancing pool hdb_backup.
>     >>>>> I set the balance mode upmap recently after the 589TB data was
>     >>>>> written.
>     >>>>> root@ld3955:~# ceph balancer status
>     >>>>> {
>     >>>>> "active": true,
>     >>>>> "plans": [],
>     >>>>> "mode": "upmap"
>     >>>>> }
>     >>>>>
>     >>>>>
>     >>>>> In order to resolve the issue with pool hdd I started some
>     >>>>> investigation.
>     >>>>> First step was to install drivers for the NIC provided
>     Mellanox.
>     >>>>> Then I configured some kernel parameters recommended
>     >>>>>
>     <https://community.mellanox.com/s/article/linux-sysctl-tuning> by
>     >>>>> Mellanox.
>     >>>>>
>     >>>>> However this didn't fix the issue.
>     >>>>> In my opinion I must get rid of all "slow requests are
>     blocked".
>     >>>>>
>     >>>>> When I check the output of ceph health detail any OSD listed
>     under
>     >>>>> REQUEST_SLOW points to an OSD that belongs to pool hdd.
>     >>>>> This means none of the disks belonging to pool hdb_backup is
>     >>>>> showing a
>     >>>>> comparable behaviour.
>     >>>>>
>     >>>>> Then I checked the running processes on the different OSD
>     nodes; I
>     >>>>> use
>     >>>>> tool "glances" here.
>     >>>>> Here I can see single processes that are running for hours and
>     >>>>> consuming
>     >>>>> much CPU, e.g.
>     >>>>> 66.8 0.2 2.13G 1.17G 1192756 ceph 17h8:33 58 0 S
>     >>>>> 41M 2K
>     >>>>> /usr/bin/ceph-osd -f --cluster ceph --id 37 --setuser ceph
>     >>>>> --setgroup ceph
>     >>>>> 34.2 0.2 4.31G 1.20G 971267 ceph 15h38:46 58 0 S
>     >>>>> 14M 3K
>     >>>>> /usr/bin/ceph-osd -f --cluster ceph --id 73 --setuser ceph
>     >>>>> --setgroup ceph
>     >>>>>
>     >>>>> Similar processes are running on 4 OSD nodes.
>     >>>>> All processes have in common that the relevant OSD belongs
>     to pool
>     >>>>> hdd.
>     >>>>>
>     >>>>> Furthermore glances gives me this alert:
>     >>>>> CRITICAL on CPU_IOWAIT (Min:1.9 Mean:2.3 Max:2.6): ceph-osd,
>     >>>>> ceph-osd,
>     >>>>> ceph-osd
>     >>>>>
>     >>>>> What can / should I do now?
>     >>>>> Kill the long running processes?
>     >>>>> Stop the relevant OSDs?
>     >>>>>
>     >>>>> Please advise?
>     >>>>>
>     >>>>> THX
>     >>>>> Thomas
>     >>>>> _______________________________________________
>     >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     >>>>> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>     >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>     >>>>> <mailto:ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>>
>     >>>>>
>     >>>>>
>     >>>>>
>     >>>> _______________________________________________
>     >>>> ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx