Re: I/O hangs with 2 node failure even if one node isn't involved in I/O

Wes Dillingham <wes_dillingham@xxxxxxxxxxx> · Tue, 21 Mar 2017 13:48:29 -0400

If you had set min_size to 1 you would not have seen the writes pause. a min_size of 1 is dangerous though because it means you are 1 hard disk failure away from losing the objects within that placement group entirely. a min_size of 2 is generally considered the minimum you want but many people ignore that advice, some wish they hadn't. 

On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carheden@xxxxxxxx> wrote:
Thanks everyone for the replies. Very informative. However, should I

have expected writes to pause if I'd had min_size set to 1 instead of 2?

And yes, I was under the false impression that my rdb devices was a

single object. That explains what all those other things are on a test

cluster where I only created a single object!

--

Adam Carheden

On 03/20/2017 08:24 PM, Wes Dillingham wrote:

> This is because of the min_size specification. I would bet you have it

> set at 2 (which is good).

>

> ceph osd pool get rbd min_size

>

> With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1

> from each hosts) results in some of the objects only having 1 replica

> min_size dictates that IO freezes for those objects until min_size is

> achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas

>

> I cant tell if your under the impression that your RBD device is a

> single object. It is not. It is chunked up into many objects and spread

> throughout the cluster, as Kjeti mentioned earlier.

>

> On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx

> <mailto:kjetil@xxxxxxxxxxxx>> wrote:

>

>     Hi,

>

>     rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents

>     will get you a "prefix", which then gets you on to

>     rbd_header.<prefix>, rbd_header.prefix contains block size,

>     striping, etc. The actual data bearing objects will be named

>     something like rbd_data.prefix.%-016x.

>

>     Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first

>     <block size> of that image will be named rbd_data.

>     86ce2ae8944a.000000000000, the second <block size> will be

>     86ce2ae8944a.000000000001, and so on, chances are that one of these

>     objects are mapped to a pg which has both host3 and host4 among it's

>     replicas.

>

>     An rbd image will end up scattered across most/all osds of the pool

>     it's in.

>

>     Cheers,

>     -KJ

>

>     On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carheden@xxxxxxxx

>     <mailto:carheden@xxxxxxxx>> wrote:

>

>         I have a 4 node cluster shown by `ceph osd tree` below. Monitors are

>         running on hosts 1, 2 and 3. It has a single replicated pool of size

>         3. I have a VM with its hard drive replicated to OSDs 11(host3),

>         5(host1) and 3(host2).

>

>         I can 'fail' any one host by disabling the SAN network interface and

>         the VM keeps running with a simple slowdown in I/O performance

>         just as

>         expected. However, if 'fail' both nodes 3 and 4, I/O hangs on

>         the VM.

>         (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2

>         still

>         have quorum, so that shouldn't be an issue. The placement group

>         still

>         has 2 of its 3 replicas online.

>

>         Why does I/O hang even though host4 isn't running a monitor and

>         doesn't have anything to do with my VM's hard drive.

>

>

>         Size?

>         # ceph osd pool get rbd size

>         size: 3

>

>         Where's rbd_id.vm-100-disk-1?

>         # ceph osd getmap -o /tmp/map && osdmaptool --pool 0

>         --test-map-object

>         rbd_id.vm-100-disk-1 /tmp/map

>         got osdmap epoch 1043

>         osdmaptool: osdmap file '/tmp/map'

>          object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]

>

>         # ceph osd tree

>         ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY

>         -1 8.06160 root default

>         -7 5.50308     room A

>         -3 1.88754         host host1

>          4 0.40369             osd.4       up  1.00000          1.00000

>          5 0.40369             osd.5       up  1.00000          1.00000

>          6 0.54008             osd.6       up  1.00000          1.00000

>          7 0.54008             osd.7       up  1.00000          1.00000

>         -2 3.61554         host host2

>          0 0.90388             osd.0       up  1.00000          1.00000

>          1 0.90388             osd.1       up  1.00000          1.00000

>          2 0.90388             osd.2       up  1.00000          1.00000

>          3 0.90388             osd.3       up  1.00000          1.00000

>         -6 2.55852     room B

>         -4 1.75114         host host3

>          8 0.40369             osd.8       up  1.00000          1.00000

>          9 0.40369             osd.9       up  1.00000          1.00000

>         10 0.40369             osd.10      up  1.00000          1.00000

>         11 0.54008             osd.11      up  1.00000          1.00000

>         -5 0.80737         host host4

>         12 0.40369             osd.12      up  1.00000          1.00000

>         13 0.40369             osd.13      up  1.00000          1.00000

>

>

>         --

>         Adam Carheden

>         _______________________________________________

>         ceph-users mailing list

>         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>

>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

>

>

>

>

>     --

>     Kjetil Joergensen <kjetil@xxxxxxxxxxxx <mailto:kjetil@xxxxxxxxxxxx>>

>     SRE, Medallia Inc

>     Phone: +1 (650) 739-6580 <tel:(650)%20739-6580>

>

>     _______________________________________________

>     ceph-users mailing list

>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>

>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

>

>

>

>

> --

> Respectfully,

>

> Wes Dillingham

> wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@harvard.edu>

> Research Computing | Infrastructure Engineer

> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Respectfully,
Wes Dillingham
wes_dillingham@xxxxxxxxxxx
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com