I have 5 RBD kernel based clients, all using kernel 3.11.1, running
Ubuntu 1304, that all failed with a write error at the same time and I
need help to figure out what caused the failure.
The 5 clients were all using the same pool, and each had its own image,
with an 18TB XFS file system on each client.
The errors reported in syslog on all 5 clients, which all came at about
the same time were:
Sep 26 16:51:44 tca10 kernel: [244870.621836] rbd: rbd1: write 8000 at
89591958000 (158000)
Sep 26 16:51:44 tca10 kernel: [244870.621842] rbd: rbd1: result -28
xferred 8000
Sep 26 16:51:52 tca14 kernel: [245058.782519] rbd: rbd1: write 8000 at
89593150000 (150000)
Sep 26 16:51:52 tca14 kernel: [245058.782524] rbd: rbd1: result -28
xferred 8000
Sep 26 16:51:33 tca15 kernel: [245043.427752] rbd: rbd1: write 8000 at
89593638000 (238000)
Sep 26 16:51:33 tca15 kernel: [245043.427758] rbd: rbd1: result -28
xferred 8000
Sep 26 16:51:40 tca16 kernel: [245054.429278] rbd: rbd1: write 8000 at
89593128000 (128000)
Sep 26 16:51:40 tca16 kernel: [245054.429284] rbd: rbd1: result -28
xferred 8000
Sep 26 16:51:23 k6 kernel: [90574.093432] rbd: rbd1: write 80000 at
f3e93a80000 (280000)
Sep 26 16:51:23 k6 kernel: [90574.093441] rbd: rbd1: result -28
xferred 80000
The client systems had been running read/write tests on each of the
clients, and had been running on some
of the clients for more then 2 days before it failed.
The ceph version on the cluster is 0.67.3 running on Ubuntu 1304 with
3.11.1 kernels. The cluster config includes 3 monitors, 6 OSD nodes
with 15 disk drives each, for a total of 90 OSD. All monitors and OSD
are running:
# ceph -v
ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
---
An ls of the rbd-pool shows
# rbd ls -l -p rbd-pool
NAME SIZE PARENT FMT PROT LOCK
k6_tst 17578G 1
tca10_tst 17578G 1
tca14_tst 17578G 1
tca15_tst 17578G 1
tca16_tst 17578G 1
---
There is still space in the pool:
# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
249T 118T 131T 52.60
POOLS:
NAME ID USED %USED OBJECTS
data 0 0 0 0
metadata 1 0 0 0
rbd 2 8 0 1
rbd-pool 3 67187G 26.30 17713336
---
# ceph health detail
HEALTH_WARN 9 near full osd(s)
osd.9 is near full at 85%
osd.29 is near full at 85%
osd.43 is near full at 91%
osd.45 is near full at 88%
osd.47 is near full at 88%
osd.55 is near full at 94%
osd.59 is near full at 94%
osd.67 is near full at 94%
osd.83 is near full at 94%
---
I did find these messages on one of my monitors, that occurred around
the same time as the write failure
2013-09-26 16:50:43.567007 7fc2cc197700 0
mon.tca11@0(leader).data_health(10) update_stats avail 91% total
70303160 used 2625788 avail 64083132
2013-09-26 16:51:23.519378 7fc2cc197700 1 mon.tca11@0(leader).osd e769
New setting for CEPH_OSDMAP_FULL -- doing propose
2013-09-26 16:51:23.520896 7fc2cb996700 1 mon.tca11@0(leader).osd e770
e770: 90 osds: 90 up, 90 in full
2013-09-26 16:51:23.521808 7fc2cb996700 0 log [INF] : osdmap e770: 90
osds: 90 up, 90 in full
2013-09-26 16:51:43.567118 7fc2cc197700 0
mon.tca11@0(leader).data_health(10) update_stats avail 91% total
70303160 used 2631904 avail 64077016
2013-09-26 16:52:43.567227 7fc2cc197700 0
mon.tca11@0(leader).data_health(10) update_stats avail 91% total
70303160 used 2632956 avail 64075964
2013-09-26 16:53:28.534868 7fc2cc197700 1 mon.tca11@0(leader).osd e770
New setting for CEPH_OSDMAP_FULL -- doing propose
2013-09-26 16:53:28.536477 7fc2cb996700 1 mon.tca11@0(leader).osd e771
e771: 90 osds: 90 up, 90 in
2013-09-26 16:53:28.538782 7fc2cb996700 0 log [INF] : osdmap e771: 90
osds: 90 up, 90 in
2013-09-26 16:53:43.567331 7fc2cc197700 0
mon.tca11@0(leader).data_health(10) update_stats avail 91% total
70303160 used 2623788 avail 64085132
---
All my OSD are reporting they are up:
# ceph osd tree
# id weight type name up/down reweight
-1 249.7 root default
-2 51.72 host tca22
0 3.63 osd.0 up 1
6 3.63 osd.6 up 1
12 3.63 osd.12 up 1
18 3.63 osd.18 up 1
24 3.63 osd.24 up 1
30 3.63 osd.30 up 1
36 2.72 osd.36 up 1
42 3.63 osd.42 up 1
48 3.63 osd.48 up 1
54 2.72 osd.54 up 1
60 3.63 osd.60 up 1
66 3.63 osd.66 up 1
72 2.72 osd.72 up 1
78 3.63 osd.78 up 1
84 3.63 osd.84 up 1
-3 31.5 host tca23
1 3.63 osd.1 up 1
7 0.26 osd.7 up 1
13 2.72 osd.13 up 1
19 2.72 osd.19 up 1
25 0.26 osd.25 up 1
31 3.63 osd.31 up 1
37 2.72 osd.37 up 1
43 0.26 osd.43 up 1
49 3.63 osd.49 up 1
55 0.26 osd.55 up 1
61 3.63 osd.61 up 1
67 0.26 osd.67 up 1
73 3.63 osd.73 up 1
79 0.26 osd.79 up 1
85 3.63 osd.85 up 1
-4 51.72 host tca24
2 3.63 osd.2 up 1
8 3.63 osd.8 up 1
14 3.63 osd.14 up 1
20 3.63 osd.20 up 1
26 3.63 osd.26 up 1
32 3.63 osd.32 up 1
38 2.72 osd.38 up 1
44 3.63 osd.44 up 1
50 3.63 osd.50 up 1
56 2.72 osd.56 up 1
62 3.63 osd.62 up 1
68 3.63 osd.68 up 1
74 2.72 osd.74 up 1
80 3.63 osd.80 up 1
86 3.63 osd.86 up 1
-5 31.5 host tca25
3 3.63 osd.3 up 1
9 0.26 osd.9 up 1
15 2.72 osd.15 up 1
21 2.72 osd.21 up 1
27 0.26 osd.27 up 1
33 3.63 osd.33 up 1
39 2.72 osd.39 up 1
45 0.26 osd.45 up 1
51 3.63 osd.51 up 1
57 0.26 osd.57 up 1
63 3.63 osd.63 up 1
69 0.26 osd.69 up 1
75 3.63 osd.75 up 1
81 0.26 osd.81 up 1
87 3.63 osd.87 up 1
-6 51.72 host tca26
4 3.63 osd.4 up 1
10 3.63 osd.10 up 1
16 3.63 osd.16 up 1
22 3.63 osd.22 up 1
28 3.63 osd.28 up 1
34 3.63 osd.34 up 1
40 2.72 osd.40 up 1
46 3.63 osd.46 up 1
52 3.63 osd.52 up 1
58 2.72 osd.58 up 1
64 3.63 osd.64 up 1
70 3.63 osd.70 up 1
76 2.72 osd.76 up 1
82 3.63 osd.82 up 1
88 3.63 osd.88 up 1
-7 31.5 host tca27
5 3.63 osd.5 up 1
11 0.26 osd.11 up 1
17 2.72 osd.17 up 1
23 2.72 osd.23 up 1
29 0.26 osd.29 up 1
35 3.63 osd.35 up 1
41 2.72 osd.41 up 1
47 0.26 osd.47 up 1
53 3.63 osd.53 up 1
59 0.26 osd.59 up 1
65 3.63 osd.65 up 1
71 0.26 osd.71 up 1
77 3.63 osd.77 up 1
83 0.26 osd.83 up 1
89 3.63 osd.89 up 1
Kernel version on all systems
# cat /proc/version
Linux version 3.11.1-031101-generic (apw@gomeisa) (gcc version 4.6.3
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201309141102 SMP Sat Sep 14 15:02:49
UTC 2013
I would really like to know why it failed, before I restart my testing.
Thanks in advance,
Eric
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com