Re: ceph Nautilus lost two disk over night everything hangs

Frank Schilder <frans@xxxxxx> · Tue, 30 Mar 2021 11:20:02 +0000

Hi, this is odd. The problem with recovery when sufficiently many but less than min_size shards are present should have been resolved with osd_allow_recovery_below_min_size=true. It is really dangerous to reduce min_size below k+1 and, in fact, should never be necessary for recovery. Can you check if this option is present and set to true? If it is not working as intended, a tracker ticker might be in order.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Rainer Krienke <krienke@xxxxxxxxxxxxxx>
Sent: 30 March 2021 13:05:56
To: Eugen Block; ceph-users@xxxxxxx
Subject:  Re: ceph Nautilus lost two disk over night everything hangs

Hello,

yes your assumptions are correct pxa-rbd ist the metadata pool for
pxa-ec which uses a erasure coding 4+2 profile.

In the last hours ceph repaired most of the damage. One inactive PG
remained and in ceph health detail then told me:

---------
HEALTH_WARN Reduced data availability: 1 pg inactive, 1 pg incomplete;
15 daemons have recently crashed; 150 slow ops, oldest one blocked for
26716 sec, daemons [osd.60,osd.67] have slow ops.
PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
     pg 36.15b is remapped+incomplete, acting
[60,2147483647,23,96,2147483647,36] (reducing pool pxa-ec min_size from
5 may help; search ceph.com/docs for 'incomplete')
RECENT_CRASH 15 daemons have recently crashed
     osd.90 crashed on host ceph6 at 2021-03-29 21:14:10.442314Z
     osd.67 crashed on host ceph5 at 2021-03-30 02:21:23.944205Z
     osd.67 crashed on host ceph5 at 2021-03-30 01:39:14.452610Z
     osd.90 crashed on host ceph6 at 2021-03-29 21:14:24.222223Z
     osd.67 crashed on host ceph5 at 2021-03-30 02:35:43.373845Z
     osd.67 crashed on host ceph5 at 2021-03-30 01:19:58.762393Z
     osd.67 crashed on host ceph5 at 2021-03-30 02:09:42.297941Z
     osd.67 crashed on host ceph5 at 2021-03-30 02:28:29.981528Z
     osd.67 crashed on host ceph5 at 2021-03-30 01:50:05.374278Z
     osd.90 crashed on host ceph6 at 2021-03-29 21:13:51.896849Z
     osd.67 crashed on host ceph5 at 2021-03-30 02:00:22.593745Z
     osd.67 crashed on host ceph5 at 2021-03-30 01:29:39.170134Z
     osd.90 crashed on host ceph6 at 2021-03-29 21:14:38.114768Z
     osd.67 crashed on host ceph5 at 2021-03-30 00:54:06.629808Z
     osd.67 crashed on host ceph5 at 2021-03-30 01:10:21.824447Z
---------

All osds except for 67 and 90 are up and I followed the hint in health
detail  and lowered min_size from 5 to 4 for pxa-ec. Since then ceph is
again repairing and in between some VMs in the attached proxmox cluster
are working again.

So I hope that after repairing all PGs are up, so that I can restart all
VMs again.

Thanks
Rainer

Am 30.03.21 um 11:41 schrieb Eugen Block:
> Hi,
>
> from what you've sent my conclusion about the stalled I/O would be
> indeed the min_size of the EC pool.
> There's only one PG reported as incomplete, I assume that is the EC
> pool, not the replicated pxa-rbd, right? Both pools are for rbd so I'm
> guessing the rbd headers are in pxa-rbd while the data is stored in
> pxa-ec, could you confirm that?
>
> You could add 'ceph health detail' output to your question to see which
> PG is incomplete.
> I assume that both down OSDs are in the acting set of the inactive PG,
> and since the pool's min_size is 5 the I/O pauses. If you can't wait for
> recovery to finish and can't bring up at least one of those OSDs you
> could set the min_size of pxa-ec to 4, but if you do, be aware that one
> more disk failure could mean data loss! So think carefully about it
> (maybe you could instead speed up recovery?) and don't forget to
> increase min_size back to 5 when the recovery has finished, that's very
> important!
>
> Regards,
> Eugen
>
>
> Zitat von Rainer Krienke <krienke@xxxxxxxxxxxxxx>:
>
>> Hello,
>>
>> i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we
>> lost two disks, so two OSDs (67,90) are down. The two disks are on two
>> different hosts. A third ODS on a third host repotrts slow ops. ceph
>> is repairing at the moment.
>>
>> Pools affected are eg these ones:
>>  pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor
>> 0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0
>> pg_num_min 128 target_size_ratio 0.0001 application rbd
>>
>> pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash
>> rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor
>> 0/172580/172578 flags hashpspool,ec_overwrites,selfmanaged_snaps
>> stripe_width 16384 pg_num_min 512 target_size_ratio 0.15 application rbd
>>
>> At the mmoment the proxmox-cluster using storage from the seperate
>> ceph cluster hangs. The ppols with date are erasure coded with the
>> following profile:
>>
>> crush-device-class=
>> crush-failure-domain=host
>> crush-root=default
>> jerasure-per-chunk-alignment=false
>> k=4
>> m=2
>> plugin=jerasure
>> technique=reed_sol_van
>> w=8
>>
>> What I do not understand is why access on the virtualization seem to
>> block. Could that be related to min_size of the pools cause this
>> behaviour? How can I find out if this is true or what else is causing
>> the blocking behaviour seen?
>>
>> This is the current status:
>>     health: HEALTH_WARN
>>             Reduced data availability: 1 pg inactive, 1 pg incomplete
>>             Degraded data redundancy: 42384/130014984 objects degraded
>> (0.033%), 4 pgs degraded, 5 pgs undersized
>>             15 daemons have recently crashed
>>             150 slow ops, oldest one blocked for 15901 sec, daemons
>> [osd.60,osd.67] have slow ops.
>>
>>   services:
>>     mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
>>     mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
>>     mds: cephfsrz:1 {0=ceph6=up:active} 2 up:standby
>>     osd: 144 osds: 142 up (since 4h), 142 in (since 5h); 6 remapped pgs
>>
>>   task status:
>>     scrub status:
>>         mds.ceph6: idle
>>
>>   data:
>>     pools:   15 pools, 2632 pgs
>>     objects: 21.70M objects, 80 TiB
>>     usage:   139 TiB used, 378 TiB / 517 TiB avail
>>     pgs:     0.038% pgs not active
>>              42384/130014984 objects degraded (0.033%)
>>              2623 active+clean
>>              3    active+undersized+degraded+remapped+backfilling
>>              3    active+clean+scrubbing+deep
>>              1    active+undersized+degraded+remapped+backfill_wait
>>              1    active+undersized+remapped+backfill_wait
>>              1    remapped+incomplete
>>
>>   io:
>>     client:   2.2 MiB/s rd, 3.6 MiB/s wr, 8 op/s rd, 179 op/s wr
>>     recovery: 51 MiB/s, 12 objects/s
>>
>> Thanks a lot
>> Rainer
>> --
>> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
>> 56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287
>> 1312
>> PGP: http://www.uni-koblenz.de/~krienke/mypgp.html,     Fax: +49261287
>> 1001312
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html,     Fax: +49261287
1001312
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx