Hi,
We do run min_size=2, size=3, although we do run with osd's spread across multiple racks and require the 3 replicas to be in 3 different racks. Our reasoning is - two or more machines failing at the same instant that isn't caused by switch/power is unlikely enough that we'll happily live with it, it has so far served us well.
-KJ
On Wed, Mar 22, 2017 at 7:06 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:
For the most part - I'm assuming min_size=2, size=3. In the min_size=3and size=3 this changes.size is how many replicas of an object to maintain, min_size is how manywrites need to succeed before the primary can ack the operation to the client.larger min_size most likely higher latency for writes.On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden <carheden@xxxxxxxx> wrote:On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:
>> c. Reads can continue from the single online OSD even in pgs that
>> happened to have two of 3 osds offline.
>>
>
> Hypothetically (This is partially informed guessing on my part):
> If the survivor happens to be the acting primary and it were up-to-date at
> the time,
> it can in theory serve reads. (Only the primary serves reads).
It makes no sense that only the primary could serve reads. That would
mean that even if only a single OSD failed, all PGs for which that OSD
was primary would be unreadable.Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the newprimary. It'll probably check with 3 to determine whether or not it there wereany writes it itself is unaware of - and peer if there were. Promotion shouldbe near instantaneous (well, you'd in all likelihood be able to measure it).There must be an algorithm to appoint a new primary. So in a 2 OSD
failure scenario, a new primary should be appointed after the first
failure, no? Would the final remaining OSD not appoint itself as
primary after the 2nd failure?
Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,you have no guarantee that the survivor have all writes.Assuming min_size=3 and size=3 - then yes - you're good, the survivingosd can safely be promoted - you're severely degraded, but it can safelybe promoted.If you genuinely worry about concurrent failures of 2 machines - run withmin_size=3, the price you pay is slightly increased mean/median latencyfor writes.This make sense in the context of CEPH's synchronous writes too. A
write isn't complete until all 3 OSDs in the PG have the data,
correct? So shouldn't any one of them be able to act as primary at any
time?See distinction between size and min_size.I don't see how that would change even if 2 of 3 ODS fail at exactly
the same time.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com