An OSD that is down does not recover or backfill. Faster recovery or backfill will not resolve down OSDs
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Mon, Dec 9, 2019 at 1:42 PM Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
Hi,
I think I can speed-up the recovery / backfill.
What is the recommended setting for
osd_max_backfills
osd_recovery_max_active
?
THX
Am 09.12.2019 um 13:36 schrieb Paul Emmerich:
> This message is expected.
>
> But your current situation is a great example of why having a separate
> cluster network is a bad idea in most situations.
> First thing I'd do in this scenario is to get rid of the cluster
> network and see if that helps
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90
>
>
> On Mon, Dec 9, 2019 at 11:22 AM Thomas Schneider <74cmonty@xxxxxxxxx
> <mailto:74cmonty@xxxxxxxxx>> wrote:
>
> Hi,
> I had a failure on 2 of 7 OSD nodes.
> This caused a server reboot and unfortunately the cluster network
> failed
> to come up.
>
> This resulted in many OSD down situation.
>
> I decided to stop all services (OSD, MGR, MON) and to start them
> sequentially.
>
> Now I have multiple OSD marked as down although the service is
> running.
> None of these down OSDS is connected to the 2 nodes with failure.
>
> In the OSD logs I can see multiple entries like this:
> 2019-12-09 11:13:10.378 7f9a372fb700 1 osd.374 pg_epoch: 493189
> pg[11.1992( v 457986'92619 (303558'88266,457986'92619]
> local-lis/les=466724/466725 n=4107 ec=8346/8346 lis/c 466724/466724
> les/c/f 466725/466725/176266 468956/493184/468423) [203,412] r=-1
> lpr=493184 pi=[466724,493184)/1 crt=457986'92619 lcod 0'0 unknown
> NOTIFY
> mbc={}] state<Start>: transitioning to Stray
>
> I tried to restart the impacted OSD w/o success, means the
> relevant OSD
> is still marked as down.
>
> Is there a procedure to overcome this issue, means getting all OSD up?
>
> THX
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> <mailto:ceph-users-leave@xxxxxxx>
>
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx