Dear Ceph gurus...
Just wanted to report something that may be interesting to
enhance... or maybe I am not doing the right debugging procedure.
1. I am working with 0.92.2 and I am testing the cluster in several
disaster catastrophe scenarios.
2. I have 32 OSDs distributed in 4 servers, meaning that I have 8
OSD per server.
3. I have deliberately unmounted the filesystem of osd.4 but the
daemon was left on. I just wanted to understand how the system would
react. This was what happened:
a. While there was no I/0, the system did not realized
that the osd-4 filesystem was not mounted, and the 'ceph -s'
continues to report HEALTH_OK for the system status.
b. When I've started to impose some heavy I/O, the system started
to complain of slow requests. Curiously, osd.4 never appears in
the logs.
# ceph -s
cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc
health HEALTH_WARN
170 requests are blocked > 32 sec
monmap e1: 3 mons at
{rccephmon1=192.231.127.8:6789/0,rccephmon2=192.231.127.34:6789/0,rccephmon3=192.231.127.26:6789/0}
election epoch 24, quorum 0,1,2
rccephmon1,rccephmon3,rccephmon2
mdsmap e162: 1/1/1 up {0=rccephmds=up:active}, 1
up:standby-replay
osdmap e1179: 32 osds: 32 up, 32 in
pgmap v907325: 2176 pgs, 2 pools, 4928 GB data, 1843
kobjects
14823 GB used, 74228 GB / 89051 GB avail
2174 active+clean
2 active+clean+replay
# ceph -w
(...)
2015-08-19 17:44:55.161731 osd.1 [WRN] 88 slow requests, 8
included below; oldest blocked for > 3156.325716 secs
2015-08-19 17:44:55.161940 osd.1 [WRN] slow request 1920.533342
seconds old, received at 2015-08-19 17:12:54.628258:
osd_op(client.44544.1:2266980 1000000022a.00006aec [write
524288~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161950 osd.1 [WRN] slow request 1920.511098
seconds old, received at 2015-08-19 17:12:54.650502:
osd_op(client.44544.1:2266988 1000000022a.00006aec [write
1048576~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161957 osd.1 [WRN] slow request 1920.510451
seconds old, received at 2015-08-19 17:12:54.651149:
osd_op(client.44544.1:2266996 1000000022a.00006aec [write
1572864~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161963 osd.1 [WRN] slow request 1920.488589
seconds old, received at 2015-08-19 17:12:54.673011:
osd_op(client.44544.1:2267004 1000000022a.00006aec [write
2097152~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161970 osd.1 [WRN] slow request 1920.482785
seconds old, received at 2015-08-19 17:12:54.678815:
osd_op(client.44544.1:2267012 1000000022a.00006aec [write
2621440~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
(...)
# grep "slow requests" /tmp/osd_failed.txt | awk '{print $3}' |
sort | uniq
osd.1
osd.11
osd.17
osd.23
osd.24
osd.26
osd.27
osd.31
osd.7
c. None of the standard 'ceph osd' commands indicated that the
problematic OSD was osd.4. Only looking to ceph-osd.4.log, we find
write error messages:
015-08-19 16:52:17.552512 7f6f69973700 0 --
10.100.1.167:6809/23763 >> 10.100.1.169:6800/28352
pipe(0x175ca000 sd=169 :6809 s=0 pgs=0 cs=0 l=0
c=0x1f038000).accept connect_seq 180 vs existing 179 state
standby
2015-08-19 16:52:17.566701 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.567230 7f6f89d2a700 0
filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
file or directory
2015-08-19 16:52:17.567332 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.567365 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.570206 7f6f8a72b700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.570233 7f6f8a72b700 0
filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
file or directory
2015-08-19 16:52:17.570260 7f6f8a72b700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.570603 7f6f8a72b700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.572468 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.572497 7f6f89d2a700 0
filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
file or directory
d. A restart of osd.4, remounted the osd.4 filesystem
but the slow connection remain there. I had to restart all the
osds in the same server as osd.4. Once that was done, ceph
realized there were degraded PGs, and started recovery I/O.
If I am doing my debugging correctly, I see some potential problems:
- ceph was not capable to understand there was a missing or
problematic file system until the client tries to access it for I/O.
- Then, the source of the problem was not obvious from the warning
messages ceph was throwing. In this particular situation, if I would
be running hundreds of osds and if I wouldn't know in advance which
of them was failing, it would be difficult to identify the
problematic one.
Maybe there is a more efficient way to perform this kind of debug?
If not, can't an additional check show that the daemon can not
properly access the filesystem? For example, I can imagine situation
where the filesystem may be in read-only, for example.
TIA
Goncalo
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
|