Re: [External Email] Re: Natuilus: Taking out OSDs that are 'Failure Pending' [EXT]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dave,

Actually, my failure domain is OSD since I so far only have 9 OSD nodes but
EC 8+2.  However, the drives are still functioning, except that one has
failed multiple times in the last few days, requiring a node power-cycle to
recover.  I will certainly mark that one out immediately.

The other two pending failures are behaving more politely, so I am assuming
that the cluster could copy the data elsewhere as part of the rebalance.  I
think I'm also concerned about the rebalance process moving data to these
drives with pending failures.

Since I'm EC 8+2, perhaps it is safe to mark two out simultaneously?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx

On Fri, Aug 4, 2023 at 10:16 AM Dave Holland <dh3@xxxxxxxxxxxx> wrote:

> On Fri, Aug 04, 2023 at 09:44:57AM -0400, Dave Hall wrote:
> > My inclination is to mark these 3 OSDs 'OUT' before they crash
> completely,
> > but I want to confirm my understanding of Ceph's response to this.
> Mainly,
> > given my EC pools (or replicated pools for that matter), if I mark all 3
> > OSD out all at once will I risk data loss?
>
> It depends on your crush map and failure domain layout. In the
> unlikeliest and unluckiest case, all those 3 OSDs are in different
> failure domains, and some data has 1 replica on each of those OSDs. In
> that situation, if you take them out simultaneously, you would lose
> data. If you're unsure, then do them one at a time and wait for the
> rebalance/backfill to complete before doing the next.
>
> We arrange our OSDs so that the failure domain is the rack; losing an
> entire rack is safe (and we've had that happen) so we know it's safe
> to pull any number of OSDs in the same rack and we won't lose data.
>
> Dave
> --
> **   Dave Holland   ** Systems Support -- Informatics Systems Group **
> ** dh3@xxxxxxxxxxxx **    Wellcome Sanger Institute, Hinxton, UK    **
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is Wellcome Sanger Institute, Wellcome Genome Campus,
>  Hinxton, CB10 1SA.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux