Re: ceph osd debug question / proposal

Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> · Tue, 25 Aug 2015 10:41:05 +1000

    Hi Jan...

    We were interested in the situation where an rm -Rf is done in the
    current directory of the OSD. Here are my findings:

    1. In this exercise, we simply deleted all the content
      of /var/lib/ceph/osd/ceph-23/current.

      # cd /var/lib/ceph/osd/ceph-23/current

        # rm -Rf *

        # df

        (...)

        /dev/sdj1      2918054776    434548 2917620228   1%
        /var/lib/ceph/osd/ceph-23

    2. After some time, ceph enters in error state because
      it thinks it has an inconsistent PG and several scrub errors

      # ceph -s

            cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

             health HEALTH_ERR

                    1 pgs inconsistent

                    1850 scrub errors

             monmap e1: 3 mons at
        {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                    election epoch 24, quorum 0,1,2 mon1,mon3,mon2

             mdsmap e162: 1/1/1 up {0=mds=up:active}, 1
        up:standby-replay

             osdmap e1903: 32 osds: 32 up, 32 in

              pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843
        kobjects

                    14424 GB used, 74627 GB / 89051 GB avail

                        2175 active+clean

                           1 active+clean+inconsistent

          client io 989 B/s rd, 1 op/s

    3. Looking to ceph.log in the mon, it is possible to
      check which is the PG affected and which OSD is responsible for
      the error:

      # tail -f /var/log/ceph/ceph.log

        (...)

        2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 :
        cluster [ERR] be_compare_scrubmaps: 5.336 shard 23
        missing
        e300336/100000001b0.00002825/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        32600336/10000000109.00000754/head//5be_compare_scrubmaps: 5.336
          shard 23 missing
        dd700336/100000001ab.00000b91/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        bc220336/100000001bd.0000387c/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        f9320336/10000000201.00002e96/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        1a920336/10000000228.0000d501/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        24a20336/100000001bc.00003e06/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        cd20336/10000000227.00004775/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        cef20336/100000001b9.00002260/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        ba240336/100000001d8.00000630/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        3e740336/100000001b1.00002089/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        e840336/100000001ba.00002618/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        17b40336/100000000e9.00000287/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        b7950336/100000000e4.00000800/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        94560336/100000001b4.00002834/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        71370336/10000000051.00000179/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        62370336/100000001b5.00003b5b/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        e9670336/10000000120.000003f8/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        1b480336/1000000019a.00000d4b/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        11880336/100000001e8.000003e9/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        56c80336/10000000083.00000255/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        97790336/100000001e7.00000668/head//5be_compare_scrubmaps: 5.336
        shard 23 missing
        e4ca0336/100000001b6.0000278c/head//5be_compare_scrubmaps: 5.336
        shard 23 missing 4eda0336/1000000019e.000036ad/head//5

        (...)

        2015-08-24 11:31:14.336760 osd.13 X.X.X.X:6804/20104 2476 :
        cluster [ERR] 5.336 scrub 1850 missing, 0 inconsistent objects

        2015-08-24 11:31:14.336764 osd.13 X.X.X.X:6804/20104 2477 :
        cluster [ERR] 5.336 scrub 1850 errors

    4. We have tried to restart the problematic osd, but
      that fails.

      # /etc/init.d/ceph stop osd.23

        === osd.23 === 

        Stopping Ceph osd.23 on osd3...done

        [root@osd3 ~]# /etc/init.d/ceph start osd.23

        === osd.23 === 

        create-or-move updated item name 'osd.23' weight 2.72 at
        location {host=osd3,root=default} to crush map

        Starting Ceph osd.23 on osd3...

        starting osd.23 at :/0 osd_data /var/lib/ceph/osd/ceph-23
        /var/lib/ceph/osd/ceph-23/journal

        # tail -f /var/log/ceph/ceph-osd.23.log

        2015-08-24 11:48:12.189322 7fa24d85d800  0 ceph version 0.94.2
        (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd,
        pid 7266

        2015-08-24 11:48:12.389747 7fa24d85d800  0
        filestore(/var/lib/ceph/osd/ceph-23) backend xfs (magic
        0x58465342)

        2015-08-24 11:48:12.391370 7fa24d85d800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-23)
        detect_features: FIEMAP ioctl is supported and appears to work

        2015-08-24 11:48:12.391381 7fa24d85d800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-23)
        detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
        config option

        2015-08-24 11:48:12.404785 7fa24d85d800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-23)
        detect_features: syscall(SYS_syncfs, fd) fully supported

        2015-08-24 11:48:12.404874 7fa24d85d800  0
        xfsfilestorebackend(/var/lib/ceph/osd/ceph-23) detect_features:
        disabling extsize, kernel 2.6.32-504.16.2.el6.x86_64 is older
        than 3.5 and has buggy extsize ioctl

        2015-08-24 11:48:12.405226 7fa24d85d800 -1
        filestore(/var/lib/ceph/osd/ceph-23) mount initial op seq is 0;
        something is wrong

        2015-08-24 11:48:12.405243 7fa24d85d800 -1 osd.23 0 OSD:init:
        unable to mount object store

        2015-08-24 11:48:12.405251 7fa24d85d800 -1   ERROR: osd init
        failed: (22) Invalid argument

    5. At this point, osd.23 is reported as 'down' but 'in',
      and ceph finally understands that there is an osd down, and that
      there are degraded PGs.

      # ceph osd dump

        (...)

        osd.23 down in  weight 1 up_from 276 up_thru 1838 down_at 1904
        last_clean_interval [112,208) X.X.X.X:6812/30826
        10.100.1.169:6812/30826 10.100.1.169:6813/30826
        X.X.X.X:6813/30826 exists 3010de97-3080-42e2-9a64-4bb1960d40b4

        # ceph -s

            cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

             health HEALTH_ERR

                    202 pgs degraded

                    1 pgs inconsistent

                    201 pgs stuck unclean

                    202 pgs undersized

                    recovery 155149/5664204 objects degraded (2.739%)

                    1850 scrub errors

                    1/32 in osds are down

             monmap e1: 3 mons at
        {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                    election epoch 24, quorum 0,1,2 mon1,mon3,mon2

             mdsmap e162: 1/1/1 up {0=mds=up:active}, 1
        up:standby-replay

             osdmap e1905: 32 osds: 31 up, 32 in

              pgmap v1041433: 2176 pgs, 2 pools, 4930 GB data, 1843
        kobjects

                    14424 GB used, 74627 GB / 89051 GB avail

                    155149/5664204 objects degraded (2.739%)

                        1974 active+clean

                         201 active+undersized+degraded

                           1 active+undersized+degraded+inconsistent

          client io 1023 B/s rd, 0 op/s

    6. Recovery I/O starts and finishes, but the systems
      remains in error state because of the inconsistent PG

      # ceph -s

            cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

             health HEALTH_ERR

                    1 pgs inconsistent

                    1850 scrub errors

             monmap e1: 3 mons at
        {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                    election epoch 24, quorum 0,1,2 mon1,mon3,mon2

             mdsmap e162: 1/1/1 up {0=mds=up:active}, 1
        up:standby-replay

             osdmap e2097: 32 osds: 31 up, 31 in

              pgmap v1043287: 2176 pgs, 2 pools, 4930 GB data, 1843
        kobjects

                    14838 GB used, 71430 GB / 86269 GB avail

                        2172 active+clean

                           2 active+clean+replay

                           1 active+clean+scrubbing

                           1 active+clean+inconsistent

          client io 1063 B/s rd, 2 op/s

        # ceph health detail

        HEALTH_ERR 1 pgs inconsistent; 1850 scrub errors

        pg 5.336 is active+clean+inconsistent, acting [13,2,22]

        1850 scrub errors

    7. The inconsistent PG is then repaired, and the system
      returns to a 'HEALTH_OK' status

      # ceph pg repair 5.336

        instructing pg 5.336 on osd.13 to repair

        $ tail -f /var/log/ceph/ceph.log

        2015-08-24 12:30:49.583363 mon.0 X.X.X.X:6789/0 289961 : cluster
        [INF] pgmap v1043322: 2176 pgs: 2173 active+clean, 1
        active+clean+inconsistent, 2 active+clean+replay; 4930 GB data,
        14833 GB used, 71435 GB / 86269 GB avail; 1023 B/s rd, 1 op/s

        2015-08-24 12:36:20.894597 osd.13 192.231.127.168:6804/20104
        2496 : cluster [INF] 5.336 repair starts

        2015-08-24 12:39:27.105511 osd.13 192.231.127.168:6804/20104
        2497 : cluster [INF] 5.336 repair ok, 0 fixed

        # ceph -s

            cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

             health HEALTH_OK

             monmap e1: 3 mons at
        {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                    election epoch 24, quorum 0,1,2 mon1,mon3,mon2

             mdsmap e162: 1/1/1 up {0=mds=up:active}, 1
        up:standby-replay

             osdmap e2097: 32 osds: 31 up, 31 in

              pgmap v1043543: 2176 pgs, 2 pools, 4930 GB data, 1843
        kobjects

                    14828 GB used, 71440 GB / 86269 GB avail

                        2174 active+clean

                           2 active+clean+replay

          client io 1023 B/s rd, 1 op/s

    On 08/24/2015 07:49 PM, Jan Schermer
      wrote:

      I'm not talking about IO happening, I'm talking about file descriptors staying open. If they weren't open you could umount it without the "-l".
Once you hit the OSD again all those open files will start working and if more need to be opened it will start looking for them...

Jan

        On 24 Aug 2015, at 03:07, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote:

Hi Jan...

Thank for the reply.

Yes, I did an 'umount -l' but I was sure that no I/O was happening at the time. So, I was almost 100% sure that there were no real incoherence in terms of open files in the OS.

On 08/20/2015 07:31 PM, Jan Schermer wrote:

          Just to clarify - you unmounted the filesystem with "umount -l"? That almost never a good idea, and it puts the OSD in a very unusual situation where IO will actually work on the open files, but it can't open any new ones. I think this would be enough to confuse just about any piece of software.

        Yes, I did an 'umount -l' but I was sure that no I/O was happening at the time. So, I was almost 100% sure that there were no real incoherence in terms of open files in the OS.

          Was journal on the filesystem or on a separate partition/device?

        The journal in on the same disk, but in a different partition.

          It's not the same as R/O filesystem (I hit that once and no such havoc happened), in my experience the OSD traps and exits when something like that happens.

It would be interesting to know what would happen if you just did rm -rf /var/lib/ceph/osd/ceph-4/current/* - that could be an equivalent to umount -l, more or less :-)

        Will try that today and report back here.

Cheers
Goncalo

    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com