Den fre 3 mars 2023 kl 01:07 skrev <stefan.pinter@xxxxxxxxxxxxxxxx>: > it is unclear for us what min_size means besides what it does. i hope someone can clear this up :) > someone pointed out "split brain" but I am unsure about this. > > i think what happens in the worst case is this: > only 1 PG is available, client writes changes to this PG, the disk of this 1 PG dies as well - so i guess we'd need to restore the data from the 2 offline PGs in the room that is down and we would have lots of trouble with restoring and also with data inconsistency, right? Do not assume the last PG needs to die in a horrible fire, killing several DC operators with it, it only takes a REALLY small outage, a fluke restart, a short network glitch for it to be away, while still having the last acknowledged write to it, making any 1-2-3-4-5 other PG replicas of it useless, since the cluster will know that this one missing OSD has the latest writes, and no other PG can replace it, it can't be "repaired" from others. Perhaps this PG was in a bad place on the platter but no one noticed since it wasn't the primary replica of this PG, but when it comes back and has to replicate all of its (new or old) data to the other replicas, you find that faulty part of the magnetics. This is very much like we all did long ago, putting some 10 large drives in a big raid-5, you get a silent error somewhere on drive5, but the "patrol reads" don't end up there for a long while, because it takes time to read all sectors on all disks. Then drive 3 fails, and suddenly ALL 9 remaining drives need to read ALL their sectors in order to rebuild to the replacement (or hotspare) drive. Now you find the unfound problem on drive5 and hence can't rebuild it all reliably to the spare drive, and someone is going to lose their data, small or big. We hoped that "two drive failures at the same time, on the same day is very unlikely", which is true, except we did not get them on the same day, we just noticed them both that bad day. This causes rumours about "same brand drives fail in clusters" which in some sense is still true, but the above scenario gives it extra validation it doesn't deserve. So after repeating those kinds of mistakes for a bit too long, storage admins decided that losing client data is not worth the "savings" one thinks one does when you cheap out on SATA disks, as if this was the main cost driver in your organisation and not the data or the employees time and tears. This is why ceph defaults to 3 copies, preferably on 3+ different hosts and min_size 2. If we are down to one, we spend 100% of the capacity to make another copy fast, not keep taking on IO in a really really bad situation. There are settings for stretched clusters, that sort of have you use 2+2 copies on each site and so forth, but you really should consider what happens at each side when your DC-DC link falls over. Which side wins, and why. Perhaps it is better to have separate clusters with some sync mechanism, and let each side go as standalone islands, perhaps some other solution is better, but from the wisdom of many ceph admins we see that noone recommends min_size 1 as a permanent way of "solving" anything. Even replica=2 is frowned upon, because any surprise there puts you at size=1 immediately. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx