Re: I/O hangs with 2 node failure even if one node isn't involved in I/O

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,


I should clarify. When you worry about concurrent osd failures, it's more likely that the source of that is from i.e. network/rack/power - you'd organize your osd's spread across those failure domains, and tell crush that you put each replica in separate failure domains. I.e. you have 3 or more racks, with their own TOR switches and hopefully power circuits. You'll tell crush to spread your 3 replicas so that they're in separate racks.

We do run min_size=2, size=3, although we do run with osd's spread across multiple racks and require the 3 replicas to be in 3 different racks. Our reasoning is - two or more machines failing at the same instant that isn't caused by switch/power is unlikely enough that we'll happily live with it, it has so far served us well.

-KJ

On Wed, Mar 22, 2017 at 7:06 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:

For the most part - I'm assuming min_size=2, size=3. In the min_size=3
and size=3 this changes.

size is how many replicas of an object to maintain, min_size is how many
writes need to succeed before the primary can ack the operation to the client.

larger min_size most likely higher latency for writes.

On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden <carheden@xxxxxxxx> wrote:
On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:

>> c. Reads can continue from the single online OSD even in pgs that
>> happened to have two of 3 osds offline.
>>
>
> Hypothetically (This is partially informed guessing on my part):
> If the survivor happens to be the acting primary and it were up-to-date at
> the time,
> it can in theory serve reads. (Only the primary serves reads).

It makes no sense that only the primary could serve reads. That would
mean that even if only a single OSD failed, all PGs for which that OSD
was primary would be unreadable.

Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the new
primary. It'll probably check with 3 to determine whether or not it there were
any writes it itself is unaware of - and peer if there were. Promotion should
be near instantaneous (well, you'd in all likelihood be able to measure it).
 
There must be an algorithm to appoint a new primary. So in a 2 OSD
failure scenario, a new primary should be appointed after the first
failure, no? Would the final remaining OSD not appoint itself as
primary after the 2nd failure?


Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,
you have no guarantee that the survivor have all writes.

Assuming min_size=3 and size=3 - then yes - you're good, the surviving
osd can safely be promoted - you're severely degraded, but it can safely
be promoted.

If you genuinely worry about concurrent failures of 2 machines - run with
min_size=3, the price you pay is slightly increased mean/median latency
for writes.

This make sense in the context of CEPH's synchronous writes too. A
write isn't complete until all 3 OSDs in the PG have the data,
correct? So shouldn't any one of them be able to act as primary at any
time?

See distinction between size and min_size.
 
I don't see how that would change even if 2 of 3 ODS fail at exactly
the same time.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc



--
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux