Re: ceph Nautilus lost two disk over night everything hangs

Frank Schilder <frans@xxxxxx> · Tue, 30 Mar 2021 12:58:54 +0000

Sorry about the flow of messages.

I forgot to mention this. Looking at other replies, in particular, that the PG in question remained at 4 out of 6 OSDs until you reduced min_size might indicate that peering was blocked for some reason, but completed after the reduction. If this was the order of events, it seems like an important detail.

It is true that recovery will need to wait until the PG has the missing OSDs assigned. If this assignment is somehow blocked by min-size>k, the flag osd_allow_recovery_below_min_size itself will have no effect.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder
Sent: 30 March 2021 14:53:18
To: Rainer Krienke; Eugen Block; ceph-users@xxxxxxx
Subject: Re:  Re: ceph Nautilus lost two disk over night everything hangs

Dear Rainer,

hmm, maybe the option is ignored or not implemented properly. This option set to true should have the same effect as reducing min_size *except* that new writes will not go to non-redundant storage. When reducing min-size, a critically degraded PG will accept new writes, which is the danger of data-loss mentioned before and avoided if only recovery ops are allowed on such PGs.

Can you open a tracker about your observation that reducing min-size was necessary and helped despite osd_allow_recovery_below_min_size=true?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Rainer Krienke <krienke@xxxxxxxxxxxxxx>
Sent: 30 March 2021 13:30:00
To: Frank Schilder; Eugen Block; ceph-users@xxxxxxx
Subject: Re:  Re: ceph Nautilus lost two disk over night everything hangs

Hello Frank,

the option is actually set. On one of my monitors:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep
osd_allow_recovery_below_min_size
     "osd_allow_recovery_below_min_size": "true",

Thank you very much
Rainer

Am 30.03.21 um 13:20 schrieb Frank Schilder:
> Hi, this is odd. The problem with recovery when sufficiently many but less than min_size shards are present should have been resolved with osd_allow_recovery_below_min_size=true. It is really dangerous to reduce min_size below k+1 and, in fact, should never be necessary for recovery. Can you check if this option is present and set to true? If it is not working as intended, a tracker ticker might be in order.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>

  --
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html,     Fax: +49261287
1001312
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx