Hello,
i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we
lost two disks, so two OSDs (67,90) are down. The two disks are on two
different hosts. A third ODS on a third host repotrts slow ops. ceph is
repairing at the moment.
Pools affected are eg these ones:
pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor
0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0
pg_num_min 128 target_size_ratio 0.0001 application rbd
pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor 0/172580/172578
flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
pg_num_min 512 target_size_ratio 0.15 application rbd
At the mmoment the proxmox-cluster using storage from the seperate ceph
cluster hangs. The ppols with date are erasure coded with the following
profile:
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
What I do not understand is why access on the virtualization seem to
block. Could that be related to min_size of the pools cause this
behaviour? How can I find out if this is true or what else is causing
the blocking behaviour seen?
This is the current status:
health: HEALTH_WARN
Reduced data availability: 1 pg inactive, 1 pg incomplete
Degraded data redundancy: 42384/130014984 objects degraded
(0.033%), 4 pgs degraded, 5 pgs undersized
15 daemons have recently crashed
150 slow ops, oldest one blocked for 15901 sec, daemons
[osd.60,osd.67] have slow ops.
services:
mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
mds: cephfsrz:1 {0=ceph6=up:active} 2 up:standby
osd: 144 osds: 142 up (since 4h), 142 in (since 5h); 6 remapped pgs
task status:
scrub status:
mds.ceph6: idle
data:
pools: 15 pools, 2632 pgs
objects: 21.70M objects, 80 TiB
usage: 139 TiB used, 378 TiB / 517 TiB avail
pgs: 0.038% pgs not active
42384/130014984 objects degraded (0.033%)
2623 active+clean
3 active+undersized+degraded+remapped+backfilling
3 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfill_wait
1 active+undersized+remapped+backfill_wait
1 remapped+incomplete
io:
client: 2.2 MiB/s rd, 3.6 MiB/s wr, 8 op/s rd, 179 op/s wr
recovery: 51 MiB/s, 12 objects/s
Thanks a lot
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
1001312
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx