Re: OSD state<Start>: transitioning to Stray

Thomas Schneider <74cmonty@xxxxxxxxx> · Mon, 9 Dec 2019 14:39:36 +0100

According to ceph -s the cluster is in recovery, backfill, ect.

  data:
    pools:   7 pools, 19656 pgs
    objects: 65.02M objects, 248 TiB
    usage:   761 TiB used, 580 TiB / 1.3 PiB avail
    pgs:     16.173% pgs unknown
             0.493% pgs not active
             890328/195069177 objects degraded (0.456%)
             828080/195069177 objects misplaced (0.425%)
             15733 active+clean
             3179  unknown
             215   active+undersized+degraded+remapped+backfilling
             152   active+undersized+degraded+remapped+backfill_wait
             135   active+remapped+backfill_wait
             107   active+remapped+backfilling
             65    down
             31    undersized+degraded+peered
             18    active+recovering
             7     active+recovery_wait
             6     active+recovery_wait+degraded
             4     active+recovering+degraded
             1     active+recovery_wait+remapped
             1     peering
             1     active+remapped+backfill_toofull
             1    
active+undersized+degraded+remapped+backfill_wait+backfill_toofull

  io:
    client:   607 B/s rd, 134 MiB/s wr, 0 op/s rd, 34 op/s wr
    recovery: 1.9 GiB/s, 511 objects/s

Am 09.12.2019 um 13:44 schrieb Paul Emmerich:
> An OSD that is down does not recover or backfill. Faster recovery or
> backfill will not resolve down OSDs
>
>
> Paul
>
> -- 
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90
>
>
> On Mon, Dec 9, 2019 at 1:42 PM Thomas Schneider <74cmonty@xxxxxxxxx
> <mailto:74cmonty@xxxxxxxxx>> wrote:
>
>     Hi,
>
>     I think I can speed-up the recovery / backfill.
>
>     What is the recommended setting for
>     osd_max_backfills
>     osd_recovery_max_active
>     ?
>
>     THX
>
>     Am 09.12.2019 um 13:36 schrieb Paul Emmerich:
>     > This message is expected.
>     >
>     > But your current situation is a great example of why having a
>     separate
>     > cluster network is a bad idea in most situations.
>     > First thing I'd do in this scenario is to get rid of the cluster
>     > network and see if that helps
>     >
>     >
>     > Paul
>     >
>     > --
>     > Paul Emmerich
>     >
>     > Looking for help with your Ceph cluster? Contact us at
>     https://croit.io
>     >
>     > croit GmbH
>     > Freseniusstr. 31h
>     > 81247 München
>     > www.croit.io <http://www.croit.io> <http://www.croit.io>
>     > Tel: +49 89 1896585 90
>     >
>     >
>     > On Mon, Dec 9, 2019 at 11:22 AM Thomas Schneider
>     <74cmonty@xxxxxxxxx <mailto:74cmonty@xxxxxxxxx>
>     > <mailto:74cmonty@xxxxxxxxx <mailto:74cmonty@xxxxxxxxx>>> wrote:
>     >
>     >     Hi,
>     >     I had a failure on 2 of 7 OSD nodes.
>     >     This caused a server reboot and unfortunately the cluster
>     network
>     >     failed
>     >     to come up.
>     >
>     >     This resulted in many OSD down situation.
>     >
>     >     I decided to stop all services (OSD, MGR, MON) and to start them
>     >     sequentially.
>     >
>     >     Now I have multiple OSD marked as down although the service is
>     >     running.
>     >     None of these down OSDS is connected to the 2 nodes with
>     failure.
>     >
>     >     In the OSD logs I can see multiple entries like this:
>     >     2019-12-09 11:13:10.378 7f9a372fb700  1 osd.374 pg_epoch: 493189
>     >     pg[11.1992( v 457986'92619 (303558'88266,457986'92619]
>     >     local-lis/les=466724/466725 n=4107 ec=8346/8346 lis/c
>     466724/466724
>     >     les/c/f 466725/466725/176266 468956/493184/468423) [203,412]
>     r=-1
>     >     lpr=493184 pi=[466724,493184)/1 crt=457986'92619 lcod 0'0
>     unknown
>     >     NOTIFY
>     >     mbc={}] state<Start>: transitioning to Stray
>     >
>     >     I tried to restart the impacted OSD w/o success, means the
>     >     relevant OSD
>     >     is still marked as down.
>     >
>     >     Is there a procedure to overcome this issue, means getting
>     all OSD up?
>     >
>     >     THX
>     >     _______________________________________________
>     >     ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     >     <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>     >     To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>     >     <mailto:ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>>
>     >
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx