Re: I/O hangs with 2 node failure even if one node isn't involved in I/O

Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> · Tue, 21 Mar 2017 12:54:23 -0700

Hi,
On Tue, Mar 21, 2017 at 11:59 AM, Adam Carheden <carheden@xxxxxxxx> wrote:
Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts

fail. Are all of the following accurate?

a. An rdb is split into lots of objects, parts of which will probably

exist on all 4 hosts.

Correct.

b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs.

Likely correct.

c. Reads can continue from the single online OSD even in pgs that

happened to have two of 3 osds offline.

Hypothetically (This is partially informed guessing on my part):
If the survivor happens to be the acting primary and it were up-to-date at the time,
it can in theory serve reads. (Only the primary serves reads).

If the survivor weren't the acting primary - you don't have any guarantees as to
whether or not it had the most up-to-date version of any objects. I don't know
if enough state is tracked outside of the osds to make this determination, but
I doubt it (it feels costly to maintain).

Regardless of scenario - I'd guess - the PG is marked as down, and will stay
that way until you revive either of deceased OSDs or you essentially tell ceph
that they're a lost cause and incur potential data loss over that. (See: ceph osd lost).

d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet

the min_size=2 constraint.

Correct. 

e. Rebalancing does not occur because with only two hosts online there

is no way for CRUSH to meet the size=3 constraint even if it were to

rebalance.

Partially correct, see c)

f. I/O can been restored by setting min_size=1.

See c)

g. Alternatively, I/O can be restored by setting size=2, which would

kick off rebalancing and restored I/O as the pgs come into compliance

with the size=2 constraint.

See c)

h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and

two hosts fail, some pgs would have only 1 OSD online, but rebalancing

would start immediately since CRUSH can honor the size=3 constraint by

rebalancing. This means more nodes makes for a more reliable cluster.

See c)

Side-note: This is where you start using crush to enumerate what you'd consider
the likely failure domains for concurrent failures. I.e. you have racks with distinct
power circuits and TOR switches, your more likely large scale failures will be
a rack, so you tell crush to maintain replicas in distinct racks.

i. If I wanted to force CRUSH to bring I/O back online with size=3 and

min_size=2 but only 2 hosts online, I could remove the host bucket from

the crushmap. CRUSH would then rebalance, but some PGs would likely end

up with 3 OSDs all on the same host. (This is theory. I promise not to

do any such thing to a production system ;)

Partially correct, see c). 

Thanks

--

Adam Carheden

On 03/21/2017 11:48 AM, Wes Dillingham wrote:

> If you had set min_size to 1 you would not have seen the writes pause. a

> min_size of 1 is dangerous though because it means you are 1 hard disk

> failure away from losing the objects within that placement group

> entirely. a min_size of 2 is generally considered the minimum you want

> but many people ignore that advice, some wish they hadn't.

>

> On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carheden@xxxxxxxx

> <mailto:carheden@xxxxxxxx>> wrote:

>

>     Thanks everyone for the replies. Very informative. However, should I

>     have expected writes to pause if I'd had min_size set to 1 instead of 2?

>

>     And yes, I was under the false impression that my rdb devices was a

>     single object. That explains what all those other things are on a test

>     cluster where I only created a single object!

>

>

>     --

>     Adam Carheden

>

>     On 03/20/2017 08:24 PM, Wes Dillingham wrote:

>     > This is because of the min_size specification. I would bet you have it

>     > set at 2 (which is good).

>     >

>     > ceph osd pool get rbd min_size

>     >

>     > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1

>     > from each hosts) results in some of the objects only having 1 replica

>     > min_size dictates that IO freezes for those objects until min_size is

>     > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas

>     <http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas>

>     >

>     > I cant tell if your under the impression that your RBD device is a

>     > single object. It is not. It is chunked up into many objects and spread

>     > throughout the cluster, as Kjeti mentioned earlier.

>     >

>     > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx <mailto:kjetil@xxxxxxxxxxxx>

>     > <mailto:kjetil@xxxxxxxxxxxx <mailto:kjetil@xxxxxxxxxxxx>>> wrote:

>     >

>     >     Hi,

>     >

>     >     rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents

>     >     will get you a "prefix", which then gets you on to

>     >     rbd_header.<prefix>, rbd_header.prefix contains block size,

>     >     striping, etc. The actual data bearing objects will be named

>     >     something like rbd_data.prefix.%-016x.

>     >

>     >     Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first

>     >     <block size> of that image will be named rbd_data.

>     >     86ce2ae8944a.000000000000, the second <block size> will be

>     >     86ce2ae8944a.000000000001, and so on, chances are that one of these

>     >     objects are mapped to a pg which has both host3 and host4 among it's

>     >     replicas.

>     >

>     >     An rbd image will end up scattered across most/all osds of the pool

>     >     it's in.

>     >

>     >     Cheers,

>     >     -KJ

>     >

>     >     On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carheden@xxxxxxxx <mailto:carheden@xxxxxxxx>

>     >     <mailto:carheden@xxxxxxxx <mailto:carheden@xxxxxxxx>>> wrote:

>     >

>     >         I have a 4 node cluster shown by `ceph osd tree` below.

>     Monitors are

>     >         running on hosts 1, 2 and 3. It has a single replicated

>     pool of size

>     >         3. I have a VM with its hard drive replicated to OSDs

>     11(host3),

>     >         5(host1) and 3(host2).

>     >

>     >         I can 'fail' any one host by disabling the SAN network

>     interface and

>     >         the VM keeps running with a simple slowdown in I/O performance

>     >         just as

>     >         expected. However, if 'fail' both nodes 3 and 4, I/O hangs on

>     >         the VM.

>     >         (i.e. `df` never completes, etc.) The monitors on hosts 1

>     and 2

>     >         still

>     >         have quorum, so that shouldn't be an issue. The placement

>     group

>     >         still

>     >         has 2 of its 3 replicas online.

>     >

>     >         Why does I/O hang even though host4 isn't running a

>     monitor and

>     >         doesn't have anything to do with my VM's hard drive.

>     >

>     >

>     >         Size?

>     >         # ceph osd pool get rbd size

>     >         size: 3

>     >

>     >         Where's rbd_id.vm-100-disk-1?

>     >         # ceph osd getmap -o /tmp/map && osdmaptool --pool 0

>     >         --test-map-object

>     >         rbd_id.vm-100-disk-1 /tmp/map

>     >         got osdmap epoch 1043

>     >         osdmaptool: osdmap file '/tmp/map'

>     >          object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]

>     >

>     >         # ceph osd tree

>     >         ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT

>     PRIMARY-AFFINITY

>     >         -1 8.06160 root default

>     >         -7 5.50308     room A

>     >         -3 1.88754         host host1

>     >          4 0.40369             osd.4       up  1.00000

>     1.00000

>     >          5 0.40369             osd.5       up  1.00000

>     1.00000

>     >          6 0.54008             osd.6       up  1.00000

>     1.00000

>     >          7 0.54008             osd.7       up  1.00000

>     1.00000

>     >         -2 3.61554         host host2

>     >          0 0.90388             osd.0       up  1.00000

>     1.00000

>     >          1 0.90388             osd.1       up  1.00000

>     1.00000

>     >          2 0.90388             osd.2       up  1.00000

>     1.00000

>     >          3 0.90388             osd.3       up  1.00000

>     1.00000

>     >         -6 2.55852     room B

>     >         -4 1.75114         host host3

>     >          8 0.40369             osd.8       up  1.00000

>     1.00000

>     >          9 0.40369             osd.9       up  1.00000

>     1.00000

>     >         10 0.40369             osd.10      up  1.00000

>     1.00000

>     >         11 0.54008             osd.11      up  1.00000

>     1.00000

>     >         -5 0.80737         host host4

>     >         12 0.40369             osd.12      up  1.00000

>     1.00000

>     >         13 0.40369             osd.13      up  1.00000

>     1.00000

>     >

>     >

>     >         --

>     >         Adam Carheden

>     >         _______________________________________________

>     >         ceph-users mailing list

>     >         ceph-users@xxxxxxxxxxxxxx

>     <mailto:ceph-users@xxxxxxxxxx.com> <mailto:ceph-users@xxxxxxxxxx.com

>     <mailto:ceph-users@xxxxxxxxxx.com>>

>     >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

>     >         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

>     >

>     >

>     >

>     >

>     >     --

>     >     Kjetil Joergensen <kjetil@xxxxxxxxxxxx

>     <mailto:kjetil@xxxxxxxxxxxx> <mailto:kjetil@xxxxxxxxxxxx

>     <mailto:kjetil@xxxxxxxxxxxx>>>

>     >     SRE, Medallia Inc

>     >     Phone: +1 (650) 739-6580 <tel:%2B1%20%28650%29%20739-6580>

>     <tel:(650)%20739-6580>

>     >

>     >     _______________________________________________

>     >     ceph-users mailing list

>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>

>     <mailto:ceph-users@xxxxxxxxxx.com <mailto:ceph-users@xxxxxxxxxx.com>>

>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

>     >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

>     >

>     >

>     >

>     >

>     > --

>     > Respectfully,

>     >

>     > Wes Dillingham

>     > wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@harvard.edu>

>     <mailto:wes_dillingham@harvard.edu <mailto:wes_dillingham@harvard.edu>>

>     > Research Computing | Infrastructure Engineer

>     > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210

>     >

>     _______________________________________________

>     ceph-users mailing list

>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>

>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

>

>

>

>

> --

> Respectfully,

>

> Wes Dillingham

> wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@harvard.edu>

> Research Computing | Infrastructure Engineer

> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com