Re: Natuilus: Taking out OSDs that are 'Failure Pending' [EXT]

Dave Holland <dh3@xxxxxxxxxxxx> · Fri, 4 Aug 2023 15:15:28 +0100

On Fri, Aug 04, 2023 at 09:44:57AM -0400, Dave Hall wrote:
> My inclination is to mark these 3 OSDs 'OUT' before they crash completely,
> but I want to confirm my understanding of Ceph's response to this.  Mainly,
> given my EC pools (or replicated pools for that matter), if I mark all 3
> OSD out all at once will I risk data loss?

It depends on your crush map and failure domain layout. In the
unlikeliest and unluckiest case, all those 3 OSDs are in different
failure domains, and some data has 1 replica on each of those OSDs. In
that situation, if you take them out simultaneously, you would lose
data. If you're unsure, then do them one at a time and wait for the
rebalance/backfill to complete before doing the next.

We arrange our OSDs so that the failure domain is the rack; losing an
entire rack is safe (and we've had that happen) so we know it's safe
to pull any number of OSDs in the same rack and we won't lose data.

Dave
-- 
**   Dave Holland   ** Systems Support -- Informatics Systems Group **
** dh3@xxxxxxxxxxxx **    Wellcome Sanger Institute, Hinxton, UK    **

-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is Wellcome Sanger Institute, Wellcome Genome Campus, 
 Hinxton, CB10 1SA.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx