Re: ceph osd debug question / proposal

Jan Schermer <jan@xxxxxxxxxxx> · Thu, 20 Aug 2015 11:31:26 +0200

Just to clarify - you unmounted the filesystem with "umount -l"? That almost never a good idea, and it puts the OSD in a very unusual situation where IO will actually work on the open files, but it can't open any new ones. I think this would be enough to confuse just about any piece of software.Was journal on the filesystem or on a separate partition/device?

It's not the same as R/O filesystem (I hit that once and no such havoc happened), in my experience the OSD traps and exits when something like that happens.

It would be interesting to know what would happen if you just did rm -rf /var/lib/ceph/osd/ceph-4/current/* - that could be an equivalent to umount -l, more or less :-)

Jan

On 20 Aug 2015, at 08:01, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote:

    Dear Ceph gurus...

    Just wanted to report something that may be interesting to
    enhance... or maybe I am not doing the right debugging procedure. 

    1. I am working with 0.92.2 and I am testing the cluster in several
    disaster catastrophe scenarios.

    2. I have 32 OSDs distributed in 4 servers, meaning that I have 8
    OSD per server.

    3. I have deliberately unmounted  the filesystem of osd.4 but the
    daemon was left on. I just wanted to understand how the system would
    react. This was what happened:

    a. While there was no I/0, the system did not realized
      that the osd-4 filesystem was not mounted, and the 'ceph -s'
      continues to report HEALTH_OK for the system status. 

      b. When I've started to impose some heavy I/O, the system started
      to complain of slow requests. Curiously, osd.4 never appears in
      the logs.

      # ceph -s

            cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

             health HEALTH_WARN

                    170 requests are blocked > 32 sec

             monmap e1: 3 mons at
{rccephmon1=192.231.127.8:6789/0,rccephmon2=192.231.127.34:6789/0,rccephmon3=192.231.127.26:6789/0}

                    election epoch 24, quorum 0,1,2
        rccephmon1,rccephmon3,rccephmon2

             mdsmap e162: 1/1/1 up {0=rccephmds=up:active}, 1
        up:standby-replay

             osdmap e1179: 32 osds: 32 up, 32 in

              pgmap v907325: 2176 pgs, 2 pools, 4928 GB data, 1843
        kobjects

                    14823 GB used, 74228 GB / 89051 GB avail

                        2174 active+clean

                           2 active+clean+replay

        # ceph -w

        (...)          

        2015-08-19 17:44:55.161731 osd.1 [WRN] 88 slow requests, 8
        included below; oldest blocked for > 3156.325716 secs

        2015-08-19 17:44:55.161940 osd.1 [WRN] slow request 1920.533342
        seconds old, received at 2015-08-19 17:12:54.628258:
        osd_op(client.44544.1:2266980 1000000022a.00006aec [write
        524288~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
        currently waiting for replay end

        2015-08-19 17:44:55.161950 osd.1 [WRN] slow request 1920.511098
        seconds old, received at 2015-08-19 17:12:54.650502:
        osd_op(client.44544.1:2266988 1000000022a.00006aec [write
        1048576~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
        currently waiting for replay end

        2015-08-19 17:44:55.161957 osd.1 [WRN] slow request 1920.510451
        seconds old, received at 2015-08-19 17:12:54.651149:
        osd_op(client.44544.1:2266996 1000000022a.00006aec [write
        1572864~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
        currently waiting for replay end

        2015-08-19 17:44:55.161963 osd.1 [WRN] slow request 1920.488589
        seconds old, received at 2015-08-19 17:12:54.673011:
        osd_op(client.44544.1:2267004 1000000022a.00006aec [write
        2097152~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
        currently waiting for replay end

        2015-08-19 17:44:55.161970 osd.1 [WRN] slow request 1920.482785
        seconds old, received at 2015-08-19 17:12:54.678815:
        osd_op(client.44544.1:2267012 1000000022a.00006aec [write
        2621440~524288 [1@-1]] 5.e0cf740e snapc 1=[] ondisk+write e1171)
        currently waiting for replay end

        (...)                 

        # grep "slow requests" /tmp/osd_failed.txt  | awk '{print $3}' |
        sort | uniq

        osd.1

        osd.11

        osd.17

        osd.23

        osd.24

        osd.26

        osd.27

        osd.31

        osd.7

      c. None of the standard 'ceph osd' commands indicated that the
      problematic OSD was osd.4. Only looking to ceph-osd.4.log, we find
      write error messages:

      015-08-19 16:52:17.552512 7f6f69973700  0 --
        10.100.1.167:6809/23763 >> 10.100.1.169:6800/28352
        pipe(0x175ca000 sd=169 :6809 s=0 pgs=0 cs=0 l=0
        c=0x1f038000).accept connect_seq 180 vs existing 179 state
        standby

        2015-08-19 16:52:17.566701 7f6f89d2a700 -1
        filestore(/var/lib/ceph/osd/ceph-4) could not find
        e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
        or directory

        2015-08-19 16:52:17.567230 7f6f89d2a700  0
        filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
        5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
        file or directory

        2015-08-19 16:52:17.567332 7f6f89d2a700 -1
        filestore(/var/lib/ceph/osd/ceph-4) could not find
        e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
        or directory

        2015-08-19 16:52:17.567365 7f6f89d2a700 -1
        filestore(/var/lib/ceph/osd/ceph-4) could not find
        e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
        or directory

        2015-08-19 16:52:17.570206 7f6f8a72b700 -1
        filestore(/var/lib/ceph/osd/ceph-4) could not find
        e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
        or directory

        2015-08-19 16:52:17.570233 7f6f8a72b700  0
        filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
        5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
        file or directory

        2015-08-19 16:52:17.570260 7f6f8a72b700 -1
        filestore(/var/lib/ceph/osd/ceph-4) could not find
        e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
        or directory

        2015-08-19 16:52:17.570603 7f6f8a72b700 -1
        filestore(/var/lib/ceph/osd/ceph-4) could not find
        e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
        or directory

        2015-08-19 16:52:17.572468 7f6f89d2a700 -1
        filestore(/var/lib/ceph/osd/ceph-4) could not find
        e6f81180/1000000022a.00000030/head//5 in index: (2) No such file
        or directory

        2015-08-19 16:52:17.572497 7f6f89d2a700  0
        filestore(/var/lib/ceph/osd/ceph-4) write couldn't open
        5.180_head/e6f81180/1000000022a.00000030/head//5: (2) No such
        file or directory

    d. A restart of osd.4, remounted the osd.4 filesystem
      but the slow connection remain there. I had to restart all the
      osds in the same server as osd.4. Once that was done, ceph
      realized there were degraded PGs, and started recovery I/O.

    If I am doing my debugging correctly, I see some potential problems:

    - ceph was not capable to understand there was a missing or
    problematic file system until the client tries to access it for I/O.

    - Then, the source of the problem was not obvious from the warning
    messages ceph was throwing. In this particular situation, if I would
    be running hundreds of osds and if I wouldn't know in advance which
    of them was failing, it would be difficult to identify the
    problematic one. 

    Maybe there is a more efficient way to perform this kind of debug?
    If not, can't an additional check show that the daemon can not
    properly access the filesystem? For example, I can imagine situation
    where the filesystem may be in read-only, for example.

    TIA

    Goncalo

    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com