ceph Nautilus lost two disk over night everything hangs

Rainer Krienke <krienke@xxxxxxxxxxxxxx> · Tue, 30 Mar 2021 09:00:15 +0200

Hello,

i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we 
lost two disks, so two OSDs (67,90) are down. The two disks are on two 
different hosts. A third ODS on a third host repotrts slow ops. ceph is 
repairing at the moment.

Pools affected are eg these ones:
 pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor 
0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0 
pg_num_min 128 target_size_ratio 0.0001 application rbd

pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash 
rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor 0/172580/172578 
flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
pg_num_min 512 target_size_ratio 0.15 application rbd

At the mmoment the proxmox-cluster using storage from the seperate ceph 
cluster hangs. The ppols with date are erasure coded with the following 
profile:

crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

What I do not understand is why access on the virtualization seem to 
block. Could that be related to min_size of the pools cause this 
behaviour? How can I find out if this is true or what else is causing 
the blocking behaviour seen?

This is the current status:
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive, 1 pg incomplete
            Degraded data redundancy: 42384/130014984 objects degraded 
(0.033%), 4 pgs degraded, 5 pgs undersized
            15 daemons have recently crashed
            150 slow ops, oldest one blocked for 15901 sec, daemons 
[osd.60,osd.67] have slow ops.

  services:
    mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
    mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
    mds: cephfsrz:1 {0=ceph6=up:active} 2 up:standby
    osd: 144 osds: 142 up (since 4h), 142 in (since 5h); 6 remapped pgs

  task status:
    scrub status:
        mds.ceph6: idle

  data:
    pools:   15 pools, 2632 pgs
    objects: 21.70M objects, 80 TiB
    usage:   139 TiB used, 378 TiB / 517 TiB avail
    pgs:     0.038% pgs not active
             42384/130014984 objects degraded (0.033%)
             2623 active+clean
             3    active+undersized+degraded+remapped+backfilling
             3    active+clean+scrubbing+deep
             1    active+undersized+degraded+remapped+backfill_wait
             1    active+undersized+remapped+backfill_wait
             1    remapped+incomplete

  io:
    client:   2.2 MiB/s rd, 3.6 MiB/s wr, 8 op/s rd, 179 op/s wr
    recovery: 51 MiB/s, 12 objects/s

Thanks a lot
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html,     Fax: +49261287 
1001312
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx