Re: Ceph newbee questions

Marcus <marcus@xxxxxxxxxx> · Wed, 03 Jan 2024 21:01:19 +0100

Hi, thanks for your answers!

On mån, jan 1 2024 at 17:00:59 -0500, Anthony D'Atri 
<aad@xxxxxxxxxxxxxx> wrote:

 Hi and thanks for your answers!

 So my understanding from this, make sure that the "admin" node
 have a fast CPU

You don’t strictly need an admin node as such. Only worry about 
clock rate if you’re doing CephFS.

So an admin node is not required?
I can run all the daemons on three "fileservers"?
As we will use CephFS, we want good clock rate, but fileserver running 
OSD:s wants multiple cores, right?
Or did I missunderstand?

max = 3 and min = 2.

For your pools, size=3, min_size=2.  These are defaults, don’t 
worry about them.

 You can expand this cluster
 by just adding one data node,
 there is no need to expand with
 another 3 nodes, right?

With defaults, yes.

 With three data nodes and things
 starts to break. If one disk

yes

 or a few disks break

Depends on which disks.

 you will still have two copies of your objects,
 cluster will be in degraded mode until disks are replaced?

Ceph will heal itself if it can to restore redundancy.  With three 
nodes if you lose one node, it can’t heal and it’ll be in 
degraded mode but data will be available.
If you lose just one drive, unless the cluster is very full, 
redundancy will be restored using the surviving drives on that node.

 There is no healing or restructuring of objects while a couple of 
disks
 are broken?

Depends on which disks.  If you lose one on each of 3 nodes at the 
same time, that’s a problem.  If you lose one on each of 3 nodes a 
week apart, probably not.

 If one of the data nodes break, there are still 2 copies of the
 objects and the cluster will run in degraded mode until server is
 replaced and data is replicated?

Yes.

 If two data nodes break, the cluster will fail but there is still 
one
 copy of the objects so if the two nodes are replaced the cluster and
 all objects will be there after replication. No lost data?

Correct, *if* nothing happens to the survivors.  But unless you take 
manual steps, data will be unavailable.

Most of the time if a node fails you can replace a DIMM etc. and 
bring it back.

 Many thanks!!

 Regards
 Marcus

 On fre, dec 22 2023 at 19:12:19 -0500, Anthony D'Atri 
<aad@xxxxxxxxxxxxxx <mailto:aad@xxxxxxxxxxxxxx>> wrote:
 You can do that for a PoC, but that's a bad idea for any 
production workload.  You'd want at least three nodes with OSDs 
to use the default RF=3 replication.  You can do RF=2, but at the 
peril of your mortal data.
 I'm not sure I agree - I think size=2, min_size=2 is no worse than
 RAID1 for data security.
 size=2, min_size=2 *is* RAID1.  Except that you become unavailable 
if a single drive is unavailable.
 That isn't even the main risk as I understand it.  Of course a 
double
 failure is going to be a problem with size=2, or traditional 
RAID1,
 and I think anybody choosing this configuration accepts this risk.
 We see people often enough who don´t know that.  I´ve seen 
double failures.  ymmv.
  As I understand it, the reason min_size=1 is a trap has nothing 
to do
 with double failures per se.
 It´s one of the concerns.
 The issue is that Ceph OSDs are somewhat prone to flapping during
 recovery (OOM, etc).  So even if the disk is fine, an OSD can go 
down
 for a short time.  If you have size=2, min=1 configured, then when
 this happens the PG will become degraded and will continue 
operating
 on the other OSD, and the flapping OSD becomes stale.  Then when 
it
 comes back up it recovers.  The problem is that if the other OSD 
has a
 permanent failure (disk crash/etc) while the first OSD is 
flapping,
 now you have no good OSDs, because when the flapping OSD comes 
back up
 it is stale, and its PGs have no peer.
 Indeed, arguably that´s an overlapping failure.  I´ve seen this 
too, and have a pg query to demonstrate it.
 I suspect there are ways to re-activate it, though this will 
result in potential data
 inconsistency since writes were allowed to the cluster and will 
then
 get rolled back.
 Yep.
 With only two OSDs I'm guessing that would be the
 main impact (well, depending on journaling behavior/etc), but if 
you
 have more OSDs than that then you could have situations where one 
file
 is getting rolled back, and some other file isn't, and so on.
 But you´d have a voting majority.
 With min_size=2 you're fairly safe from flapping because there 
will
 always be two replicas that have the most recent version of every 
PG,
 and so you can still tolerate a permanent failure of one of them.
 Exactly.
 size=2, min=2 doesn't suffer this failure mode, because anytime 
there
 is flapping the PG goes inactive and no writes can be made, so 
when
 the other OSD comes back up there is nothing to recover.  Of 
course
 this results in IO blocks and downtime, which is obviously
 undesirable, but it is likely a more recoverable state than
 inconsistent writes.
 Agreed, the difference between availability and durability.  
Depends what´s important to you.
 Apologies if I've gotten any of that wrong, but my understanding 
is
 that it is these sorts of failure modes that cause min_size=1 to 
be a
 trap.  This isn't the sort of thing that typically happens in a 
RAID1
 config, or at least that admins don't think about.
 It´s both.
 _______________________________________________
 ceph-users mailing list -- ceph-users@xxxxxxx 
<mailto:ceph-users@xxxxxxx> <<mailto:ceph-users@xxxxxxx>>
 To unsubscribe send an email to ceph-users-leave@xxxxxxx 
<mailto:ceph-users-leave@xxxxxxx> 
<<mailto:ceph-users-leave@xxxxxxx>>

 _______________________________________________
 ceph-users mailing list -- ceph-users@xxxxxxx 
<mailto:ceph-users@xxxxxxx>
 To unsubscribe send an email to ceph-users-leave@xxxxxxx 
<mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx