Re: I/O hangs with 2 node failure even if one node isn't involved in I/O

Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> · Wed, 22 Mar 2017 19:44:29 -0700

Hi,

I should clarify. When you worry about concurrent osd failures, it's more likely that the source of that is from i.e. network/rack/power - you'd organize your osd's spread across those failure domains, and tell crush that you put each replica in separate failure domains. I.e. you have 3 or more racks, with their own TOR switches and hopefully power circuits. You'll tell crush to spread your 3 replicas so that they're in separate racks.
We do run min_size=2, size=3, although we do run with osd's spread across multiple racks and require the 3 replicas to be in 3 different racks. Our reasoning is - two or more machines failing at the same instant that isn't caused by switch/power is unlikely enough that we'll happily live with it, it has so far served us well.

-KJ

On Wed, Mar 22, 2017 at 7:06 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:

For the most part - I'm assuming min_size=2, size=3. In the min_size=3
and size=3 this changes.

size is how many replicas of an object to maintain, min_size is how many
writes need to succeed before the primary can ack the operation to the client.

larger min_size most likely higher latency for writes.

On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden <carheden@xxxxxxxx> wrote:
On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:

>> c. Reads can continue from the single online OSD even in pgs that

>> happened to have two of 3 osds offline.

>>

>

> Hypothetically (This is partially informed guessing on my part):

> If the survivor happens to be the acting primary and it were up-to-date at

> the time,

> it can in theory serve reads. (Only the primary serves reads).

It makes no sense that only the primary could serve reads. That would

mean that even if only a single OSD failed, all PGs for which that OSD

was primary would be unreadable.

Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the new
primary. It'll probably check with 3 to determine whether or not it there were
any writes it itself is unaware of - and peer if there were. Promotion should
be near instantaneous (well, you'd in all likelihood be able to measure it).

There must be an algorithm to appoint a new primary. So in a 2 OSD

failure scenario, a new primary should be appointed after the first

failure, no? Would the final remaining OSD not appoint itself as

primary after the 2nd failure?

Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,
you have no guarantee that the survivor have all writes.

Assuming min_size=3 and size=3 - then yes - you're good, the surviving
osd can safely be promoted - you're severely degraded, but it can safely
be promoted.

If you genuinely worry about concurrent failures of 2 machines - run with
min_size=3, the price you pay is slightly increased mean/median latency
for writes.

This make sense in the context of CEPH's synchronous writes too. A

write isn't complete until all 3 OSDs in the PG have the data,

correct? So shouldn't any one of them be able to act as primary at any

time?

See distinction between size and min_size.

I don't see how that would change even if 2 of 3 ODS fail at exactly

the same time.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com