ceph Nautilus lost two disk over night everything hangs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we lost two disks, so two OSDs (67,90) are down. The two disks are on two different hosts. A third ODS on a third host repotrts slow ops. ceph is repairing at the moment.

Pools affected are eg these ones:
pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor 0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0 pg_num_min 128 target_size_ratio 0.0001 application rbd

pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor 0/172580/172578 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 pg_num_min 512 target_size_ratio 0.15 application rbd

At the mmoment the proxmox-cluster using storage from the seperate ceph cluster hangs. The ppols with date are erasure coded with the following profile:

crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

What I do not understand is why access on the virtualization seem to block. Could that be related to min_size of the pools cause this behaviour? How can I find out if this is true or what else is causing the blocking behaviour seen?

This is the current status:
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive, 1 pg incomplete
Degraded data redundancy: 42384/130014984 objects degraded (0.033%), 4 pgs degraded, 5 pgs undersized
            15 daemons have recently crashed
150 slow ops, oldest one blocked for 15901 sec, daemons [osd.60,osd.67] have slow ops.

  services:
    mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
    mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
    mds: cephfsrz:1 {0=ceph6=up:active} 2 up:standby
    osd: 144 osds: 142 up (since 4h), 142 in (since 5h); 6 remapped pgs

  task status:
    scrub status:
        mds.ceph6: idle

  data:
    pools:   15 pools, 2632 pgs
    objects: 21.70M objects, 80 TiB
    usage:   139 TiB used, 378 TiB / 517 TiB avail
    pgs:     0.038% pgs not active
             42384/130014984 objects degraded (0.033%)
             2623 active+clean
             3    active+undersized+degraded+remapped+backfilling
             3    active+clean+scrubbing+deep
             1    active+undersized+degraded+remapped+backfill_wait
             1    active+undersized+remapped+backfill_wait
             1    remapped+incomplete

  io:
    client:   2.2 MiB/s rd, 3.6 MiB/s wr, 8 op/s rd, 179 op/s wr
    recovery: 51 MiB/s, 12 objects/s

Thanks a lot
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287 1001312
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux