Have you been able to start the VMs now?
Are you using RBD or are the VM's hosted on a CephFS?
---- On Mon, 23 Sep 2019 14:16:47 +0800 Thomas Schneider <74cmonty@xxxxxxxxx> wrote ----
Hi,
currently ceph -s is not reporting any unknown PGs.
The following flags are set: nobackfill, norebalance, norecover
There are no PGs in stuck peering either.
And there's very little traffic on the ceph network.
This is the output of today:
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem has a failed mds daemon
1 filesystem is offline
insufficient standby MDS daemons available
nobackfill,norebalance,norecover flag(s) set
83 nearfull osd(s)
1 pool(s) nearfull
Degraded data redundancy: 360047/153249771 objects degraded
(0.235%), 78 pgs degraded, 81 pgs undersized
Degraded data redundancy (low space): 265 pgs backfill_toofull
3 pools have too many placement groups
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 109m)
mgr: ld5505(active, since 2d), standbys: ld5506, ld5507
mds: pve_cephfs:0/1, 1 failed
osd: 368 osds: 368 up, 367 in; 398 remapped pgs
flags nobackfill,norebalance,norecover
data:
pools: 5 pools, 8868 pgs
objects: 51.08M objects, 195 TiB
usage: 590 TiB used, 562 TiB / 1.1 PiB avail
pgs: 360047/153249771 objects degraded (0.235%)
1998603/153249771 objects misplaced (1.304%)
8469 active+clean
124 active+remapped+backfill_toofull
83 active+remapped+backfill_wait+backfill_toofull
77 active+remapped+backfill_wait
45 active+undersized+degraded+remapped+backfill_toofull
33 active+remapped+backfilling
13
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
11 active+undersized+degraded+remapped+backfilling
4 active+undersized+degraded+remapped+backfill_wait
4 active+recovery_wait+undersized+degraded+remapped
3 active+recovering+undersized+remapped
1 active+recovering+undersized+degraded+remapped
1 active+recovering
io:
client: 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr
progress:
Rebalancing after osd.9 marked out
[====================..........]
2 days before the output was:
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem has a failed mds daemon
1 filesystem is offline
insufficient standby MDS daemons available
nobackfill,norebalance,norecover flag(s) set
2 backfillfull osd(s)
86 nearfull osd(s)
1 pool(s) backfillfull
Reduced data availability: 75 pgs inactive, 74 pgs peering
Degraded data redundancy: 364117/154942251 objects degraded
(0.235%), 76 pgs degraded, 76 pgs undersized
Degraded data redundancy (low space): 309 pgs backfill_toofull
3 pools have too many placement groups
105 slow requests are blocked > 32 sec
91 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 9h)
mgr: ld5505(active, since 9h), standbys: ld5506, ld5507
mds: pve_cephfs:0/1, 1 failed
osd: 368 osds: 368 up, 367 in; 400 remapped pgs
flags nobackfill,norebalance,norecover
data:
pools: 5 pools, 8868 pgs
objects: 51.65M objects, 197 TiB
usage: 596 TiB used, 554 TiB / 1.1 PiB avail
pgs: 0.011% pgs unknown
0.834% pgs not active
364117/154942251 objects degraded (0.235%)
2003579/154942251 objects misplaced (1.293%)
8395 active+clean
209 active+remapped+backfill_toofull
69 active+undersized+degraded+remapped+backfill_toofull
49 active+remapped+backfill_wait
37 peering
37 remapped+peering
31 active+remapped+backfill_wait+backfill_toofull
17 active+clean+scrubbing+deep
14 active+clean+scrubbing
4 active+undersized+degraded
2 active+undersized+degraded+remapped+backfill_wait
1 unknown
1 active+clean+remapped
1 active+remapped+backfilling
1 active+undersized+degraded+remapped+backfilling
io:
client: 256 MiB/s wr, 0 op/s rd, 99 op/s wr
progress:
Rebalancing after osd.9 marked out
[==............................]
As you can see there's very little progress in "Rebalancing after osd.9
marked out", the number of objects degraded dropped to 360047/153249771
and the number of objects misplaced to 1998603/153249771.
I think this is very little progress in 2 days with no activity (over
the weekend) on the cluster.
Am 21.09.2019 um 20:39 schrieb Paul Emmerich:
> On Sat, Sep 21, 2019 at 6:47 PM Thomas <74cmonty@xxxxxxxxx> wrote:
>> Hello,
>>
>> I have re-created the OSDs using these disks.
>> Can I still export the affected PGs manually?
> No, the data probably cannot be recovered in this case :(
> (It might still be somewhere on the disk if it hasn't been overwritten
> yet, but it's virtually impossible to recover it: the metadata has
> almost certainly long been overwritten)
>
> But that's only for 17 PGs showing up as unknown. The ones stuck in
> peering can probably be revived but without the latest writes. Can you
> run "ceph pg X.YX query" on one of the PGs stuck in peering?
> It might tell you what's wrong and how to proceed.
>
> But even 17 PGs will probably affect almost all of your VMs/containers...
>
> Paul
>
>> Regards
>> Thomas
>>
>>
>> Am 20.09.19 um 21:15 schrieb Paul Emmerich:
>>> On Fri, Sep 20, 2019 at 1:31 PM Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> I cannot get rid of
>>>> pgs unknown
>>>> because there were 3 disks that couldn't be started.
>>>> Therefore I destroyed the relevant OSD and re-created it for the
>>>> relevant disks.
>>> and you had it configured to run with replica 3? Well, I guess the
>>> down PGs where located on these three disks that you wiped.
>>>
>>> Do you still have the disks? Use ceph-objectstore-tool to export the
>>> affected PGs manually and inject them into another OSD.
>>>
>>>
>>> Paul
>>>
>>>> Then I added the 3 OSDs to crushmap.
>>>>
>>>> Regards
>>>> Thomas
>>>>
>>>> Am 20.09.2019 um 08:19 schrieb Ashley Merrick:
>>>>> Your need to fix this first.
>>>>>
>>>>> pgs: 0.056% pgs unknown
>>>>> 0.553% pgs not active
>>>>>
>>>>> The back filling will cause slow I/O, but having pgs unknown and not
>>>>> active will cause I/O blocking which your seeing with the VM booting.
>>>>>
>>>>> Seems you have 4 OSD's down, if you get them back online you should be
>>>>> able to get all the PG's online.
>>>>>
>>>>>
>>>>> ---- On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas <74cmonty@xxxxxxxxx>*
>>>>> wrote ----
>>>>>
>>>>> Hi,
>>>>>
>>>>> here I describe 1 of the 2 major issues I'm currently facing in my 8
>>>>> node ceph cluster (2x MDS, 6x ODS).
>>>>>
>>>>> The issue is that I cannot start any virtual machine KVM or container
>>>>> LXC; the boot process just hangs after a few seconds.
>>>>> All these KVMs and LXCs have in common that their virtual disks
>>>>> reside
>>>>> in the same pool: hdd
>>>>>
>>>>> This pool hdd is relatively small compared to the largest pool:
>>>>> hdb_backup
>>>>> root@ld3955:~# rados df
>>>>> POOL_NAME USED OBJECTS CLONES COPIES
>>>>> MISSING_ON_PRIMARY
>>>>> UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR
>>>>> UNDER COMPR
>>>>> backup 0 B 0 0
>>>>> 0
>>>>> 0 0 0 0 0 B 0 0 B 0
>>>>> B 0 B
>>>>> hdb_backup 589 TiB 51262212 0
>>>>> 153786636
>>>>> 0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0
>>>>> B 0 B
>>>>> hdd 3.2 TiB 281884 6568
>>>>> 845652
>>>>> 0 0 1658 275277357 16 TiB 208213922 10 TiB 0
>>>>> B 0 B
>>>>> pve_cephfs_data 955 GiB 91832 0
>>>>> 275496
>>>>> 0 0 3038 2103 1021 MiB 102170 318 GiB 0
>>>>> B 0 B
>>>>> pve_cephfs_metadata 486 MiB 62 0
>>>>> 186
>>>>> 0 0 7 860 1.4 GiB 12393 166 MiB 0
>>>>> B 0 B
>>>>>
>>>>> total_objects 51635990
>>>>> total_used 597 TiB
>>>>> total_avail 522 TiB
>>>>> total_space 1.1 PiB
>>>>>
>>>>> This is the current health status of the ceph cluster:
>>>>> cluster:
>>>>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>>>>> health: HEALTH_ERR
>>>>> 1 filesystem is degraded
>>>>> 1 MDSs report slow metadata IOs
>>>>> 1 backfillfull osd(s)
>>>>> 87 nearfull osd(s)
>>>>> 1 pool(s) backfillfull
>>>>> Reduced data availability: 54 pgs inactive, 47 pgs
>>>>> peering,
>>>>> 1 pg stale
>>>>> Degraded data redundancy: 129598/154907946 objects
>>>>> degraded
>>>>> (0.084%), 33 pgs degraded, 33 pgs undersized
>>>>> Degraded data redundancy (low space): 322 pgs
>>>>> backfill_toofull
>>>>> 1 subtrees have overcommitted pool target_size_bytes
>>>>> 1 subtrees have overcommitted pool target_size_ratio
>>>>> 1 pools have too many placement groups
>>>>> 21 slow requests are blocked > 32 sec
>>>>>
>>>>> services:
>>>>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h)
>>>>> mgr: ld5507(active, since 16h), standbys: ld5506, ld5505
>>>>> mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby
>>>>> osd: 360 osds: 356 up, 356 in; 382 remapped pgs
>>>>>
>>>>> data:
>>>>> pools: 5 pools, 8868 pgs
>>>>> objects: 51.64M objects, 197 TiB
>>>>> usage: 597 TiB used, 522 TiB / 1.1 PiB avail
>>>>> pgs: 0.056% pgs unknown
>>>>> 0.553% pgs not active
>>>>> 129598/154907946 objects degraded (0.084%)
>>>>> 2211119/154907946 objects misplaced (1.427%)
>>>>> 8458 active+clean
>>>>> 298 active+remapped+backfill_toofull
>>>>> 29 remapped+peering
>>>>> 24
>>>>> active+undersized+degraded+remapped+backfill_toofull
>>>>> 22 active+remapped+backfill_wait
>>>>> 17 peering
>>>>> 5 unknown
>>>>> 5 active+recovery_wait+undersized+degraded+remapped
>>>>> 3 active+undersized+degraded+remapped+backfill_wait
>>>>> 2 activating+remapped
>>>>> 1 active+clean+remapped
>>>>> 1 stale+peering
>>>>> 1 active+remapped+backfilling
>>>>> 1 active+recovering+undersized+remapped
>>>>> 1 active+recovery_wait+degraded
>>>>>
>>>>> io:
>>>>> client: 9.2 KiB/s wr, 0 op/s rd, 1 op/s wr
>>>>>
>>>>> I believe the cluster is busy with rebalancing pool hdb_backup.
>>>>> I set the balance mode upmap recently after the 589TB data was
>>>>> written.
>>>>> root@ld3955:~# ceph balancer status
>>>>> {
>>>>> "active": true,
>>>>> "plans": [],
>>>>> "mode": "upmap"
>>>>> }
>>>>>
>>>>>
>>>>> In order to resolve the issue with pool hdd I started some
>>>>> investigation.
>>>>> First step was to install drivers for the NIC provided Mellanox.
>>>>> Then I configured some kernel parameters recommended
>>>>> <https://community.mellanox.com/s/article/linux-sysctl-tuning> by
>>>>> Mellanox.
>>>>>
>>>>> However this didn't fix the issue.
>>>>> In my opinion I must get rid of all "slow requests are blocked".
>>>>>
>>>>> When I check the output of ceph health detail any OSD listed under
>>>>> REQUEST_SLOW points to an OSD that belongs to pool hdd.
>>>>> This means none of the disks belonging to pool hdb_backup is
>>>>> showing a
>>>>> comparable behaviour.
>>>>>
>>>>> Then I checked the running processes on the different OSD nodes; I
>>>>> use
>>>>> tool "glances" here.
>>>>> Here I can see single processes that are running for hours and
>>>>> consuming
>>>>> much CPU, e.g.
>>>>> 66.8 0.2 2.13G 1.17G 1192756 ceph 17h8:33 58 0 S
>>>>> 41M 2K
>>>>> /usr/bin/ceph-osd -f --cluster ceph --id 37 --setuser ceph
>>>>> --setgroup ceph
>>>>> 34.2 0.2 4.31G 1.20G 971267 ceph 15h38:46 58 0 S
>>>>> 14M 3K
>>>>> /usr/bin/ceph-osd -f --cluster ceph --id 73 --setuser ceph
>>>>> --setgroup ceph
>>>>>
>>>>> Similar processes are running on 4 OSD nodes.
>>>>> All processes have in common that the relevant OSD belongs to pool
>>>>> hdd.
>>>>>
>>>>> Furthermore glances gives me this alert:
>>>>> CRITICAL on CPU_IOWAIT (Min:1.9 Mean:2.3 Max:2.6): ceph-osd,
>>>>> ceph-osd,
>>>>> ceph-osd
>>>>>
>>>>> What can / should I do now?
>>>>> Kill the long running processes?
>>>>> Stop the relevant OSDs?
>>>>>
>>>>> Please advise?
>>>>>
>>>>> THX
>>>>> Thomas
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> <mailto:ceph-users@xxxxxxx>
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> <mailto:ceph-users-leave@xxxxxxx>
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx