Re: ceph osd debug question / proposal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just to clarify - you unmounted the filesystem with "umount -l"? That almost never a good idea, and it puts the OSD in a very unusual situation where IO will actually work on the open files, but it can't open any new ones. I think this would be enough to confuse just about any piece of software.
Was journal on the filesystem or on a separate partition/device?

It's not the same as R/O filesystem (I hit that once and no such havoc happened), in my experience the OSD traps and exits when something like that happens.

It would be interesting to know what would happen if you just did rm -rf /var/lib/ceph/osd/ceph-4/current/* - that could be an equivalent to umount -l, more or less :-)

Jan



On 20 Aug 2015, at 08:01, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote:

Dear Ceph gurus...

Just wanted to report something that may be interesting to enhance... or maybe I am not doing the right debugging procedure.

1. I am working with 0.92.2 and I am testing the cluster in several disaster catastrophe scenarios.

2. I have 32 OSDs distributed in 4 servers, meaning that I have 8 OSD per server.

3. I have deliberately unmounted  the filesystem of osd.4 but the daemon was left on. I just wanted to understand how the system would react. This was what happened:
a. While there was no I/0, the system did not realized that the osd-4 filesystem was not mounted, and the 'ceph -s' continues to report HEALTH_OK for the system status.

b. When I've started to impose some heavy I/O, the system started to complain of slow requests. Curiously, osd.4 never appears in the logs.
# ceph -s
    cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc
     health HEALTH_WARN
            170 requests are blocked > 32 sec
     monmap e1: 3 mons at {rccephmon1=192.231.127.8:6789/0,rccephmon2=192.231.127.34:6789/0,rccephmon3=192.231.127.26:6789/0}
            election epoch 24, quorum 0,1,2 rccephmon1,rccephmon3,rccephmon2
     mdsmap e162: 1/1/1 up {0=rccephmds=up:active}, 1 up:standby-replay
     osdmap e1179: 32 osds: 32 up, 32 in
      pgmap v907325: 2176 pgs, 2 pools, 4928 GB data, 1843 kobjects
            14823 GB used, 74228 GB / 89051 GB avail
                2174 active+clean
                   2 active+clean+replay

# ceph -w
(...)         
2015-08-19 17:44:55.161731 osd.1 [WRN] 88 slow requests, 8 included below; oldest blocked for > 3156.325716 secs
2015-08-19 17:44:55.161940 osd.1 [WRN] slow request 1920.533342 seconds old, received at 2015-08-19 17:12:54.628258: osd_op(client.44544.1:2266980 1000000022a.00006aec [write 524288~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171) currently waiting for replay end
2015-08-19 17:44:55.161950 osd.1 [WRN] slow request 1920.511098 seconds old, received at 2015-08-19 17:12:54.650502: osd_op(client.44544.1:2266988 1000000022a.00006aec [write 1048576~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171) currently waiting for replay end
2015-08-19 17:44:55.161957 osd.1 [WRN] slow request 1920.510451 seconds old, received at 2015-08-19 17:12:54.651149: osd_op(client.44544.1:2266996 1000000022a.00006aec [write 1572864~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171) currently waiting for replay end
2015-08-19 17:44:55.161963 osd.1 [WRN] slow request 1920.488589 seconds old, received at 2015-08-19 17:12:54.673011: osd_op(client.44544.1:2267004 1000000022a.00006aec [write 2097152~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171) currently waiting for replay end
2015-08-19 17:44:55.161970 osd.1 [WRN] slow request 1920.482785 seconds old, received at 2015-08-19 17:12:54.678815: osd_op(client.44544.1:2267012 1000000022a.00006aec [write 2621440~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171) currently waiting for replay end
(...)                
# grep "slow requests" /tmp/osd_failed.txt  | awk '{print $3}' | sort | uniq
osd.1
osd.11
osd.17
osd.23
osd.24
osd.26
osd.27
osd.31
osd.7

c. None of the standard 'ceph osd' commands indicated that the problematic OSD was osd.4. Only looking to ceph-osd.4.log, we find write error messages:
015-08-19 16:52:17.552512 7f6f69973700  0 -- 10.100.1.167:6809/23763 >> 10.100.1.169:6800/28352 pipe(0x175ca000 sd=169 :6809 s=0 pgs=0 cs=0 l=0 c=0x1f038000).accept connect_seq 180 vs existing 179 state standby
2015-08-19 16:52:17.566701 7f6f89d2a700 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find e6f81180/1000000022a.00000030/head//5 in index: (2) No such file or directory
2015-08-19 16:52:17.567230 7f6f89d2a700  0 filestore(/var/lib/ceph/osd/ceph-4) write couldn't open 5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such file or directory
2015-08-19 16:52:17.567332 7f6f89d2a700 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find e6f81180/1000000022a.00000030/head//5 in index: (2) No such file or directory
2015-08-19 16:52:17.567365 7f6f89d2a700 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find e6f81180/1000000022a.00000030/head//5 in index: (2) No such file or directory
2015-08-19 16:52:17.570206 7f6f8a72b700 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find e6f81180/1000000022a.00000030/head//5 in index: (2) No such file or directory
2015-08-19 16:52:17.570233 7f6f8a72b700  0 filestore(/var/lib/ceph/osd/ceph-4) write couldn't open 5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such file or directory
2015-08-19 16:52:17.570260 7f6f8a72b700 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find e6f81180/1000000022a.00000030/head//5 in index: (2) No such file or directory
2015-08-19 16:52:17.570603 7f6f8a72b700 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find e6f81180/1000000022a.00000030/head//5 in index: (2) No such file or directory
2015-08-19 16:52:17.572468 7f6f89d2a700 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find e6f81180/1000000022a.00000030/head//5 in index: (2) No such file or directory
2015-08-19 16:52:17.572497 7f6f89d2a700  0 filestore(/var/lib/ceph/osd/ceph-4) write couldn't open 5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such file or directory

d. A restart of osd.4, remounted the osd.4 filesystem but the slow connection remain there. I had to restart all the osds in the same server as osd.4. Once that was done, ceph realized there were degraded PGs, and started recovery I/O.

If I am doing my debugging correctly, I see some potential problems:

- ceph was not capable to understand there was a missing or problematic file system until the client tries to access it for I/O.
- Then, the source of the problem was not obvious from the warning messages ceph was throwing. In this particular situation, if I would be running hundreds of osds and if I wouldn't know in advance which of them was failing, it would be difficult to identify the problematic one.

Maybe there is a more efficient way to perform this kind of debug? If not, can't an additional check show that the daemon can not properly access the filesystem? For example, I can imagine situation where the filesystem may be in read-only, for example.

TIA
Goncalo

-- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux