Re: Theory about min_size and its implications

Janne Johansson <icepic.dz@xxxxxxxxx> · Fri, 3 Mar 2023 10:07:31 +0100

Den fre 3 mars 2023 kl 01:07 skrev <stefan.pinter@xxxxxxxxxxxxxxxx>:
> it is unclear for us what min_size means besides what it does. i hope someone can clear this up :)
> someone pointed out "split brain" but I am unsure about this.
>
> i think what happens in the worst case is this:
> only 1 PG is available, client writes changes to this PG, the disk of this 1 PG dies as well - so i guess we'd need to restore the data from the 2 offline PGs in the room that is down and we would have lots of trouble with restoring and also with data inconsistency, right?

Do not assume the last PG needs to die in a horrible fire, killing
several DC operators with it, it only takes a REALLY small outage, a
fluke restart, a short network glitch for it to be away, while still
having the last acknowledged write to it, making any 1-2-3-4-5 other
PG replicas of it useless, since the cluster will know that this one
missing OSD has the latest writes, and no other PG can replace it, it
can't be "repaired" from others.

Perhaps this PG was in a bad place on the platter but no one noticed
since it wasn't the primary replica of this PG, but when it comes back
and has to replicate all of its (new or old) data to the other
replicas, you find that faulty part of the magnetics.

This is very much like we all did long ago, putting some 10 large
drives in a big raid-5, you get a silent error somewhere on drive5,
but the "patrol reads" don't end up there for a long while, because it
takes time to read all sectors on all disks. Then drive 3 fails, and
suddenly ALL 9 remaining drives need to read ALL their sectors in
order to rebuild to the replacement (or hotspare) drive. Now you find
the unfound problem on drive5 and hence can't rebuild it all reliably
to the spare drive, and someone is going to lose their data, small or
big.
We hoped that "two drive failures at the same time, on the same day is
very unlikely", which is true, except we did not get them on the same
day, we just noticed them both that bad day. This causes rumours about
"same brand drives fail in clusters" which in some sense is still
true, but the above scenario gives it extra validation it doesn't
deserve.

So after repeating those kinds of mistakes for a bit too long, storage
admins decided that losing client data is not worth the "savings" one
thinks one does when you cheap out on SATA disks, as if this was the
main cost driver in your organisation and not the data or the
employees time and tears. This is why ceph defaults to 3 copies,
preferably on 3+ different hosts and min_size 2. If we are down to
one, we spend 100% of the capacity to make another copy fast, not keep
taking on IO in a really really bad situation.

There are settings for stretched clusters, that sort of have you use
2+2 copies on each site and so forth, but you really should consider
what happens at each side when your DC-DC link falls over. Which side
wins, and why. Perhaps it is better to have separate clusters with
some sync mechanism, and let each side go as standalone islands,
perhaps some other solution is better, but from the wisdom of many
ceph admins we see that noone recommends min_size 1 as a permanent way
of "solving" anything. Even replica=2 is frowned upon, because any
surprise there puts you at size=1 immediately.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx