Just to clarify - you unmounted the filesystem with "umount -l"? That almost never a good idea, and it puts the OSD in a very unusual situation where IO will actually work on the open files, but it can't open any new ones. I think this would be enough to confuse just about any piece of software. Was journal on the filesystem or on a separate partition/device?
It's not the same as R/O filesystem (I hit that once and no such havoc happened), in my experience the OSD traps and exits when something like that happens.
It would be interesting to know what would happen if you just did rm -rf /var/lib/ceph/osd/ceph-4/current/* - that could be an equivalent to umount -l, more or less :-)
Jan
Dear Ceph gurus...
Just wanted to report something that may be interesting to
enhance... or maybe I am not doing the right debugging procedure.
1. I am working with 0.92.2 and I am testing the cluster in several
disaster catastrophe scenarios.
2. I have 32 OSDs distributed in 4 servers, meaning that I have 8
OSD per server.
3. I have deliberately unmounted the filesystem of osd.4 but the
daemon was left on. I just wanted to understand how the system would
react. This was what happened:
a. While there was no I/0, the system did not realized
that the osd-4 filesystem was not mounted, and the 'ceph -s'
continues to report HEALTH_OK for the system status.
b. When I've started to impose some heavy I/O, the system started
to complain of slow requests. Curiously, osd.4 never appears in
the logs.
# ceph -s
cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc
health HEALTH_WARN
170 requests are blocked > 32 sec
monmap e1: 3 mons at
{rccephmon1=192.231.127.8:6789/0,rccephmon2=192.231.127.34:6789/0,rccephmon3=192.231.127.26:6789/0}
election epoch 24, quorum 0,1,2
rccephmon1,rccephmon3,rccephmon2
mdsmap e162: 1/1/1 up {0=rccephmds=up:active}, 1
up:standby-replay
osdmap e1179: 32 osds: 32 up, 32 in
pgmap v907325: 2176 pgs, 2 pools, 4928 GB data, 1843
kobjects
14823 GB used, 74228 GB / 89051 GB avail
2174 active+clean
2 active+clean+replay
# ceph -w
(...)
2015-08-19 17:44:55.161731 osd.1 [WRN] 88 slow requests, 8
included below; oldest blocked for > 3156.325716 secs
2015-08-19 17:44:55.161940 osd.1 [WRN] slow request 1920.533342
seconds old, received at 2015-08-19 17:12:54.628258:
osd_op(client.44544.1:2266980 1000000022a.00006aec [write
524288~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161950 osd.1 [WRN] slow request 1920.511098
seconds old, received at 2015-08-19 17:12:54.650502:
osd_op(client.44544.1:2266988 1000000022a.00006aec [write
1048576~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161957 osd.1 [WRN] slow request 1920.510451
seconds old, received at 2015-08-19 17:12:54.651149:
osd_op(client.44544.1:2266996 1000000022a.00006aec [write
1572864~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161963 osd.1 [WRN] slow request 1920.488589
seconds old, received at 2015-08-19 17:12:54.673011:
osd_op(client.44544.1:2267004 1000000022a.00006aec [write
2097152~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
2015-08-19 17:44:55.161970 osd.1 [WRN] slow request 1920.482785
seconds old, received at 2015-08-19 17:12:54.678815:
osd_op(client.44544.1:2267012 1000000022a.00006aec [write
2621440~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
currently waiting for replay end
(...)
# grep "slow requests" /tmp/osd_failed.txt | awk '{print $3}' |
sort | uniq
osd.1
osd.11
osd.17
osd.23
osd.24
osd.26
osd.27
osd.31
osd.7
c. None of the standard 'ceph osd' commands indicated that the
problematic OSD was osd.4. Only looking to ceph-osd.4.log, we find
write error messages:
015-08-19 16:52:17.552512 7f6f69973700 0 --
10.100.1.167:6809/23763 >> 10.100.1.169:6800/28352
pipe(0x175ca000 sd=169 :6809 s=0 pgs=0 cs=0 l=0
c=0x1f038000).accept connect_seq 180 vs existing 179 state
standby
2015-08-19 16:52:17.566701 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.567230 7f6f89d2a700 0
filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
file or directory
2015-08-19 16:52:17.567332 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.567365 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.570206 7f6f8a72b700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.570233 7f6f8a72b700 0
filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
file or directory
2015-08-19 16:52:17.570260 7f6f8a72b700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.570603 7f6f8a72b700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.572468 7f6f89d2a700 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
or directory
2015-08-19 16:52:17.572497 7f6f89d2a700 0
filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
file or directory
d. A restart of osd.4, remounted the osd.4 filesystem
but the slow connection remain there. I had to restart all the
osds in the same server as osd.4. Once that was done, ceph
realized there were degraded PGs, and started recovery I/O.
If I am doing my debugging correctly, I see some potential problems:
- ceph was not capable to understand there was a missing or
problematic file system until the client tries to access it for I/O.
- Then, the source of the problem was not obvious from the warning
messages ceph was throwing. In this particular situation, if I would
be running hundreds of osds and if I wouldn't know in advance which
of them was failing, it would be difficult to identify the
problematic one.
Maybe there is a more efficient way to perform this kind of debug?
If not, can't an additional check show that the daemon can not
properly access the filesystem? For example, I can imagine situation
where the filesystem may be in read-only, for example.
TIA
Goncalo
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxxhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|