Re: I/O hangs with 2 node failure even if one node isn't involved in I/O

Wes Dillingham <wes_dillingham@xxxxxxxxxxx> · Mon, 20 Mar 2017 22:24:27 -0400

This is because of the min_size specification. I would bet you have it set at 2 (which is good). 

ceph osd pool get rbd min_size

With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1 from each hosts) results in some of the objects only having 1 replica
min_size dictates that IO freezes for those objects until min_size is achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas

I cant tell if your under the impression that your RBD device is a single object. It is not. It is chunked up into many objects and spread throughout the cluster, as Kjeti mentioned earlier.

On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:
Hi,
rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get you a "prefix", which then gets you on to rbd_header.<prefix>, rbd_header.prefix contains block size, striping, etc. The actual data bearing objects will be named something like rbd_data.prefix.%-016x.

Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first <block size> of that image will be named rbd_data. 86ce2ae8944a.000000000000, the second <block size> will be 86ce2ae8944a.000000000001, and so on, chances are that one of these objects are mapped to a pg which has both host3 and host4 among it's replicas.

An rbd image will end up scattered across most/all osds of the pool it's in.

Cheers,
-KJ

On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carheden@xxxxxxxx> wrote:
I have a 4 node cluster shown by `ceph osd tree` below. Monitors are

running on hosts 1, 2 and 3. It has a single replicated pool of size

3. I have a VM with its hard drive replicated to OSDs 11(host3),

5(host1) and 3(host2).

I can 'fail' any one host by disabling the SAN network interface and

the VM keeps running with a simple slowdown in I/O performance just as

expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.

(i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still

have quorum, so that shouldn't be an issue. The placement group still

has 2 of its 3 replicas online.

Why does I/O hang even though host4 isn't running a monitor and

doesn't have anything to do with my VM's hard drive.

Size?

# ceph osd pool get rbd size

size: 3

Where's rbd_id.vm-100-disk-1?

# ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object

rbd_id.vm-100-disk-1 /tmp/map

got osdmap epoch 1043

osdmaptool: osdmap file '/tmp/map'

 object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]

# ceph osd tree

ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 8.06160 root default

-7 5.50308     room A

-3 1.88754         host host1

 4 0.40369             osd.4       up  1.00000          1.00000

 5 0.40369             osd.5       up  1.00000          1.00000

 6 0.54008             osd.6       up  1.00000          1.00000

 7 0.54008             osd.7       up  1.00000          1.00000

-2 3.61554         host host2

 0 0.90388             osd.0       up  1.00000          1.00000

 1 0.90388             osd.1       up  1.00000          1.00000

 2 0.90388             osd.2       up  1.00000          1.00000

 3 0.90388             osd.3       up  1.00000          1.00000

-6 2.55852     room B

-4 1.75114         host host3

 8 0.40369             osd.8       up  1.00000          1.00000

 9 0.40369             osd.9       up  1.00000          1.00000

10 0.40369             osd.10      up  1.00000          1.00000

11 0.54008             osd.11      up  1.00000          1.00000

-5 0.80737         host host4

12 0.40369             osd.12      up  1.00000          1.00000

13 0.40369             osd.13      up  1.00000          1.00000

--

Adam Carheden

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Respectfully,
Wes Dillingham
wes_dillingham@xxxxxxxxxxx
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com