Multiple kernel RBD clients failures

Eric Eastman <eric0e@xxxxxxx> · Mon, 30 Sep 2013 10:39:01 -0400 (EDT)

I have 5 RBD kernel based clients, all using  kernel 3.11.1, running 
Ubuntu 1304, that all failed with a write error at the same time and I 
need help to figure out what caused the failure.

The 5 clients were all using the same pool, and each had its own image, 
with an 18TB XFS file system on each client.

The errors reported in syslog on all 5 clients, which all came at about 
the same time were:

Sep 26 16:51:44 tca10 kernel: [244870.621836] rbd: rbd1: write 8000 at 
89591958000 (158000)
Sep 26 16:51:44 tca10 kernel: [244870.621842] rbd: rbd1:   result -28 
xferred 8000

Sep 26 16:51:52 tca14 kernel: [245058.782519] rbd: rbd1: write 8000 at 
89593150000 (150000)
Sep 26 16:51:52 tca14 kernel: [245058.782524] rbd: rbd1:   result -28 
xferred 8000

Sep 26 16:51:33 tca15 kernel: [245043.427752] rbd: rbd1: write 8000 at 
89593638000 (238000)
Sep 26 16:51:33 tca15 kernel: [245043.427758] rbd: rbd1:   result -28 
xferred 8000

Sep 26 16:51:40 tca16 kernel: [245054.429278] rbd: rbd1: write 8000 at 
89593128000 (128000)
Sep 26 16:51:40 tca16 kernel: [245054.429284] rbd: rbd1:   result -28 
xferred 8000

Sep 26 16:51:23 k6 kernel: [90574.093432] rbd: rbd1: write 80000 at 
f3e93a80000 (280000)
Sep 26 16:51:23 k6 kernel: [90574.093441] rbd: rbd1:   result -28 
xferred 80000

The client systems had been running read/write tests on each of the 
clients, and had been running on some
of the clients for more then 2 days before it failed.

The ceph version on the cluster is  0.67.3 running on Ubuntu 1304 with 
3.11.1 kernels. The cluster config includes 3 monitors, 6 OSD nodes 
with 15 disk drives each, for a total of 90 OSD.   All monitors and OSD 
are running:

# ceph -v
ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)

---
An ls of the rbd-pool shows
# rbd ls -l -p rbd-pool
NAME        SIZE PARENT FMT PROT LOCK
k6_tst        17578G          1
tca10_tst  17578G          1
tca14_tst  17578G          1
tca15_tst  17578G          1
tca16_tst  17578G          1

---
There is still space in the pool:
# ceph df
GLOBAL:
   SIZE     AVAIL     RAW USED     %RAW USED
   249T     118T      131T         52.60

POOLS:
   NAME         ID     USED       %USED     OBJECTS
   data               0      0                     0         0
   metadata      1      0                     0         0
   rbd                 2      8                     0         1
   rbd-pool        3      67187G     26.30     17713336

---
# ceph health detail
HEALTH_WARN 9 near full osd(s)
osd.9 is near full at 85%
osd.29 is near full at 85%
osd.43 is near full at 91%
osd.45 is near full at 88%
osd.47 is near full at 88%
osd.55 is near full at 94%
osd.59 is near full at 94%
osd.67 is near full at 94%
osd.83 is near full at 94%

---
I did find these messages on one of my monitors, that occurred around 
the same time as the write failure

2013-09-26 16:50:43.567007 7fc2cc197700  0 
mon.tca11@0(leader).data_health(10) update_stats avail 91% total 
70303160 used 2625788 avail 64083132
2013-09-26 16:51:23.519378 7fc2cc197700  1 mon.tca11@0(leader).osd e769 
New setting for CEPH_OSDMAP_FULL -- doing propose
2013-09-26 16:51:23.520896 7fc2cb996700  1 mon.tca11@0(leader).osd e770 
e770: 90 osds: 90 up, 90 in full
2013-09-26 16:51:23.521808 7fc2cb996700  0 log [INF] : osdmap e770: 90 
osds: 90 up, 90 in full
2013-09-26 16:51:43.567118 7fc2cc197700  0 
mon.tca11@0(leader).data_health(10) update_stats avail 91% total 
70303160 used 2631904 avail 64077016
2013-09-26 16:52:43.567227 7fc2cc197700  0 
mon.tca11@0(leader).data_health(10) update_stats avail 91% total 
70303160 used 2632956 avail 64075964
2013-09-26 16:53:28.534868 7fc2cc197700  1 mon.tca11@0(leader).osd e770 
New setting for CEPH_OSDMAP_FULL -- doing propose
2013-09-26 16:53:28.536477 7fc2cb996700  1 mon.tca11@0(leader).osd e771 
e771: 90 osds: 90 up, 90 in
2013-09-26 16:53:28.538782 7fc2cb996700  0 log [INF] : osdmap e771: 90 
osds: 90 up, 90 in
2013-09-26 16:53:43.567331 7fc2cc197700  0 
mon.tca11@0(leader).data_health(10) update_stats avail 91% total 
70303160 used 2623788 avail 64085132

---
All my OSD are reporting they are up:
# ceph osd tree
# id	weight	type name	up/down	reweight
-1	249.7	root default
-2	51.72		host tca22
0	3.63			osd.0	up	1	
6	3.63			osd.6	up	1	
12	3.63			osd.12	up	1	
18	3.63			osd.18	up	1	
24	3.63			osd.24	up	1	
30	3.63			osd.30	up	1	
36	2.72			osd.36	up	1	
42	3.63			osd.42	up	1	
48	3.63			osd.48	up	1	
54	2.72			osd.54	up	1	
60	3.63			osd.60	up	1	
66	3.63			osd.66	up	1	
72	2.72			osd.72	up	1	
78	3.63			osd.78	up	1	
84	3.63			osd.84	up	1	
-3	31.5		host tca23
1	3.63			osd.1	up	1	
7	0.26			osd.7	up	1	
13	2.72			osd.13	up	1	
19	2.72			osd.19	up	1	
25	0.26			osd.25	up	1	
31	3.63			osd.31	up	1	
37	2.72			osd.37	up	1	
43	0.26			osd.43	up	1	
49	3.63			osd.49	up	1	
55	0.26			osd.55	up	1	
61	3.63			osd.61	up	1	
67	0.26			osd.67	up	1	
73	3.63			osd.73	up	1	
79	0.26			osd.79	up	1	
85	3.63			osd.85	up	1	
-4	51.72		host tca24
2	3.63			osd.2	up	1	
8	3.63			osd.8	up	1	
14	3.63			osd.14	up	1	
20	3.63			osd.20	up	1	
26	3.63			osd.26	up	1	
32	3.63			osd.32	up	1	
38	2.72			osd.38	up	1	
44	3.63			osd.44	up	1	
50	3.63			osd.50	up	1	
56	2.72			osd.56	up	1	
62	3.63			osd.62	up	1	
68	3.63			osd.68	up	1	
74	2.72			osd.74	up	1	
80	3.63			osd.80	up	1	
86	3.63			osd.86	up	1	
-5	31.5		host tca25
3	3.63			osd.3	up	1	
9	0.26			osd.9	up	1	
15	2.72			osd.15	up	1	
21	2.72			osd.21	up	1	
27	0.26			osd.27	up	1	
33	3.63			osd.33	up	1	
39	2.72			osd.39	up	1	
45	0.26			osd.45	up	1	
51	3.63			osd.51	up	1	
57	0.26			osd.57	up	1	
63	3.63			osd.63	up	1	
69	0.26			osd.69	up	1	
75	3.63			osd.75	up	1	
81	0.26			osd.81	up	1	
87	3.63			osd.87	up	1	
-6	51.72		host tca26
4	3.63			osd.4	up	1	
10	3.63			osd.10	up	1	
16	3.63			osd.16	up	1	
22	3.63			osd.22	up	1	
28	3.63			osd.28	up	1	
34	3.63			osd.34	up	1	
40	2.72			osd.40	up	1	
46	3.63			osd.46	up	1	
52	3.63			osd.52	up	1	
58	2.72			osd.58	up	1	
64	3.63			osd.64	up	1	
70	3.63			osd.70	up	1	
76	2.72			osd.76	up	1	
82	3.63			osd.82	up	1	
88	3.63			osd.88	up	1	
-7	31.5		host tca27
5	3.63			osd.5	up	1	
11	0.26			osd.11	up	1	
17	2.72			osd.17	up	1	
23	2.72			osd.23	up	1	
29	0.26			osd.29	up	1	
35	3.63			osd.35	up	1	
41	2.72			osd.41	up	1	
47	0.26			osd.47	up	1	
53	3.63			osd.53	up	1	
59	0.26			osd.59	up	1	
65	3.63			osd.65	up	1	
71	0.26			osd.71	up	1	
77	3.63			osd.77	up	1	
83	0.26			osd.83	up	1	
89	3.63			osd.89	up	1	

Kernel version on all systems
# cat /proc/version
Linux version 3.11.1-031101-generic (apw@gomeisa) (gcc version 4.6.3 
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201309141102 SMP Sat Sep 14 15:02:49 
UTC 2013

I would really like to know why it failed, before I restart my testing.

Thanks in advance,

Eric

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com