Re: ceph osd debug question / proposal

Shinobu <shinobu.kj@xxxxxxxxx> · Tue, 25 Aug 2015 12:32:41 +0900

Hope nobody never does that.
Anyway that's good to know in case of disaster recovery.

Thank you!

 Shinobu

On Tue, Aug 25, 2015 at 12:10 PM, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote:

    Hi Shinobu

    Human mistake, for example :-) Not very frequent, but it happens.

    Nevertheless, the idea is to test ceph against different DC
    scenarios, triggered by different problems.

    On this particular situation, the cluster recovered ok ONCE the
    problematic OSD daemon was tagged as 'down' and 'out'

    Cheers

    Goncalo

    On 08/25/2015 01:06 PM, Shinobu wrote:

          So what is the situation where you need to do:

            # cd /var/lib/ceph/osd/ceph-23/current

            # rm -Rf *

            # df

            (...)

          I'm quite sure that is not normal.

           Shinobu

                  On Tue, Aug 25, 2015 at 9:41
                    AM, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx>
                    wrote:

                       Hi Jan...

                        We were interested in the situation where an rm
                        -Rf is done in the current directory of the OSD.
                        Here are my findings:

                        1. In this exercise, we simply
                          deleted all the content of
                          /var/lib/ceph/osd/ceph-23/current.

                          # cd
                            /var/lib/ceph/osd/ceph-23/current

                            # rm -Rf *

                            # df

                            (...)

                            /dev/sdj1      2918054776    434548
                            2917620228   1% /var/lib/ceph/osd/ceph-23

                        2. After some time, ceph enters in
                          error state because it thinks it has an
                          inconsistent PG and several scrub errors

                          # ceph -s

                                cluster
                            eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

                                 health HEALTH_ERR

                                        1 pgs inconsistent

                                        1850 scrub errors

                                 monmap e1: 3 mons at
                            {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                                        election epoch 24, quorum 0,1,2
                            mon1,mon3,mon2

                                 mdsmap e162: 1/1/1 up
                            {0=mds=up:active}, 1 up:standby-replay

                                 osdmap e1903: 32 osds: 32 up, 32 in

                                  pgmap v1041261: 2176 pgs, 2 pools,
                            4930 GB data, 1843 kobjects

                                        14424 GB used, 74627 GB / 89051
                            GB avail

                                            2175 active+clean

                                               1
                            active+clean+inconsistent

                              client io 989 B/s rd, 1 op/s

                        3. Looking to ceph.log in the mon,
                          it is possible to check which is the PG
                          affected and which OSD is responsible for the
                          error:

                          # tail -f /var/log/ceph/ceph.log

                            (...)

                            2015-08-24 11:31:10.139239 osd.13
                            X.X.X.X:6804/20104 2384 : cluster [ERR]
                            be_compare_scrubmaps: 5.336 shard 23
                            missing
                            e300336/100000001b0.00002825/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            32600336/10000000109.00000754/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            dd700336/100000001ab.00000b91/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            bc220336/100000001bd.0000387c/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            f9320336/10000000201.00002e96/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            1a920336/10000000228.0000d501/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            24a20336/100000001bc.00003e06/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            cd20336/10000000227.00004775/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            cef20336/100000001b9.00002260/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            ba240336/100000001d8.00000630/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            3e740336/100000001b1.00002089/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            e840336/100000001ba.00002618/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            17b40336/100000000e9.00000287/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            b7950336/100000000e4.00000800/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            94560336/100000001b4.00002834/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            71370336/10000000051.00000179/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            62370336/100000001b5.00003b5b/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            e9670336/10000000120.000003f8/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            1b480336/1000000019a.00000d4b/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            11880336/100000001e8.000003e9/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            56c80336/10000000083.00000255/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            97790336/100000001e7.00000668/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            e4ca0336/100000001b6.0000278c/head//5be_compare_scrubmaps:
                            5.336 shard 23 missing
                            4eda0336/1000000019e.000036ad/head//5

                            (...)

                            2015-08-24 11:31:14.336760 osd.13
                            X.X.X.X:6804/20104 2476 : cluster [ERR]
                            5.336 scrub 1850 missing, 0 inconsistent
                            objects

                            2015-08-24 11:31:14.336764 osd.13
                            X.X.X.X:6804/20104 2477 : cluster [ERR]
                            5.336 scrub 1850 errors

                        4. We have tried to restart the
                          problematic osd, but that fails.

                          # /etc/init.d/ceph stop osd.23

                            === osd.23 === 

                            Stopping Ceph osd.23 on osd3...done

                            [root@osd3 ~]# /etc/init.d/ceph start osd.23

                            === osd.23 === 

                            create-or-move updated item name 'osd.23'
                            weight 2.72 at location
                            {host=osd3,root=default} to crush map

                            Starting Ceph osd.23 on osd3...

                            starting osd.23 at :/0 osd_data
                            /var/lib/ceph/osd/ceph-23
                            /var/lib/ceph/osd/ceph-23/journal

                            # tail -f /var/log/ceph/ceph-osd.23.log

                            2015-08-24 11:48:12.189322 7fa24d85d800  0
                            ceph version 0.94.2
                            (5fb85614ca8f354284c713a2f9c610860720bbf3),
                            process ceph-osd, pid 7266

                            2015-08-24 11:48:12.389747 7fa24d85d800  0
                            filestore(/var/lib/ceph/osd/ceph-23) backend
                            xfs (magic 0x58465342)

                            2015-08-24 11:48:12.391370 7fa24d85d800  0
                            genericfilestorebackend(/var/lib/ceph/osd/ceph-23)
                            detect_features: FIEMAP ioctl is supported
                            and appears to work

                            2015-08-24 11:48:12.391381 7fa24d85d800  0
                            genericfilestorebackend(/var/lib/ceph/osd/ceph-23)
                            detect_features: FIEMAP ioctl is disabled
                            via 'filestore fiemap' config option

                            2015-08-24 11:48:12.404785 7fa24d85d800  0
                            genericfilestorebackend(/var/lib/ceph/osd/ceph-23)
                            detect_features: syscall(SYS_syncfs, fd)
                            fully supported

                            2015-08-24 11:48:12.404874 7fa24d85d800  0
                            xfsfilestorebackend(/var/lib/ceph/osd/ceph-23)
                            detect_features: disabling extsize, kernel
                            2.6.32-504.16.2.el6.x86_64 is older than 3.5
                            and has buggy extsize ioctl

                            2015-08-24 11:48:12.405226 7fa24d85d800 -1
                            filestore(/var/lib/ceph/osd/ceph-23) mount
                            initial op seq is 0; something is wrong

                            2015-08-24 11:48:12.405243 7fa24d85d800 -1
                            osd.23 0 OSD:init: unable to mount object
                            store

                            2015-08-24 11:48:12.405251 7fa24d85d800 -1  
                            ERROR: osd init failed: (22) Invalid
                            argument

                        5. At this point, osd.23 is reported
                          as 'down' but 'in', and ceph finally
                          understands that there is an osd down, and
                          that there are degraded PGs.

                          # ceph osd dump

                            (...)

                            osd.23 down in  weight 1 up_from 276 up_thru
                            1838 down_at 1904 last_clean_interval
                            [112,208) X.X.X.X:6812/30826 10.100.1.169:6812/30826
                            10.100.1.169:6813/30826
                            X.X.X.X:6813/30826 exists
                            3010de97-3080-42e2-9a64-4bb1960d40b4

                            # ceph -s

                                cluster
                            eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

                                 health HEALTH_ERR

                                        202 pgs degraded

                                        1 pgs inconsistent

                                        201 pgs stuck unclean

                                        202 pgs undersized

                                        recovery 155149/5664204 objects
                            degraded (2.739%)

                                        1850 scrub errors

                                        1/32 in osds are down

                                 monmap e1: 3 mons at
                            {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                                        election epoch 24, quorum 0,1,2
                            mon1,mon3,mon2

                                 mdsmap e162: 1/1/1 up
                            {0=mds=up:active}, 1 up:standby-replay

                                 osdmap e1905: 32 osds: 31 up, 32 in

                                  pgmap v1041433: 2176 pgs, 2 pools,
                            4930 GB data, 1843 kobjects

                                        14424 GB used, 74627 GB / 89051
                            GB avail

                                        155149/5664204 objects degraded
                            (2.739%)

                                            1974 active+clean

                                             201
                            active+undersized+degraded

                                               1
                            active+undersized+degraded+inconsistent

                              client io 1023 B/s rd, 0 op/s

                        6. Recovery I/O starts and finishes,
                          but the systems remains in error state because
                          of the inconsistent PG

                          # ceph -s

                                cluster
                            eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

                                 health HEALTH_ERR

                                        1 pgs inconsistent

                                        1850 scrub errors

                                 monmap e1: 3 mons at
                            {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                                        election epoch 24, quorum 0,1,2
                            mon1,mon3,mon2

                                 mdsmap e162: 1/1/1 up
                            {0=mds=up:active}, 1 up:standby-replay

                                 osdmap e2097: 32 osds: 31 up, 31 in

                                  pgmap v1043287: 2176 pgs, 2 pools,
                            4930 GB data, 1843 kobjects

                                        14838 GB used, 71430 GB / 86269
                            GB avail

                                            2172 active+clean

                                               2 active+clean+replay

                                               1 active+clean+scrubbing

                                               1
                            active+clean+inconsistent

                              client io 1063 B/s rd, 2 op/s

                            # ceph health detail

                            HEALTH_ERR 1 pgs inconsistent; 1850 scrub
                            errors

                            pg 5.336 is active+clean+inconsistent,
                            acting [13,2,22]

                            1850 scrub errors

                        7. The inconsistent PG is then
                          repaired, and the system returns to a
                          'HEALTH_OK' status

                          # ceph pg repair 5.336

                            instructing pg 5.336 on osd.13 to repair

                            $ tail -f /var/log/ceph/ceph.log

                            2015-08-24 12:30:49.583363 mon.0
                            X.X.X.X:6789/0 289961 : cluster [INF] pgmap
                            v1043322: 2176 pgs: 2173 active+clean, 1
                            active+clean+inconsistent, 2
                            active+clean+replay; 4930 GB data, 14833 GB
                            used, 71435 GB / 86269 GB avail; 1023 B/s
                            rd, 1 op/s

                            2015-08-24 12:36:20.894597 osd.13 192.231.127.168:6804/20104
                            2496 : cluster [INF] 5.336 repair starts

                            2015-08-24 12:39:27.105511 osd.13 192.231.127.168:6804/20104
                            2497 : cluster [INF] 5.336 repair ok, 0
                            fixed

                            # ceph -s

                                cluster
                            eea8578f-b3ac-4dfb-a0c5-da40509f5cdc

                                 health HEALTH_OK

                                 monmap e1: 3 mons at
                            {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}

                                        election epoch 24, quorum 0,1,2
                            mon1,mon3,mon2

                                 mdsmap e162: 1/1/1 up
                            {0=mds=up:active}, 1 up:standby-replay

                                 osdmap e2097: 32 osds: 31 up, 31 in

                                  pgmap v1043543: 2176 pgs, 2 pools,
                            4930 GB data, 1843 kobjects

                                        14828 GB used, 71440 GB / 86269
                            GB avail

                                            2174 active+clean

                                               2 active+clean+replay

                              client io 1023 B/s rd, 1 op/s

                        On 08/24/2015 07:49 PM, Jan Schermer wrote:

                          I'm not talking about IO happening, I'm talking about file descriptors staying open. If they weren't open you could umount it without the "-l".
Once you hit the OSD again all those open files will start working and if more need to be opened it will start looking for them...

Jan

                            On 24 Aug 2015, at 03:07, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote:

Hi Jan...

Thank for the reply.

Yes, I did an 'umount -l' but I was sure that no I/O was happening at the time. So, I was almost 100% sure that there were no real incoherence in terms of open files in the OS.

On 08/20/2015 07:31 PM, Jan Schermer wrote:

                              Just to clarify - you unmounted the filesystem with "umount -l"? That almost never a good idea, and it puts the OSD in a very unusual situation where IO will actually work on the open files, but it can't open any new ones. I think this would be enough to confuse just about any piece of software.

                            Yes, I did an 'umount -l' but I was sure that no I/O was happening at the time. So, I was almost 100% sure that there were no real incoherence in terms of open files in the OS.

                              Was journal on the filesystem or on a separate partition/device?

                            The journal in on the same disk, but in a different partition.

                              It's not the same as R/O filesystem (I hit that once and no such havoc happened), in my experience the OSD traps and exits when something like that happens.

It would be interesting to know what would happen if you just did rm -rf /var/lib/ceph/osd/ceph-4/current/* - that could be an equivalent to umount -l, more or less :-)

                            Will try that today and report back here.

Cheers
Goncalo

                            -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

                      _______________________________________________

                      ceph-users mailing list

                      ceph-users@xxxxxxxxxxxxxx

                      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                  -- 

                              Email:

                                 shinobu@xxxxxxxxx

                                 skinjo@xxxxxxxxxx

                                 Life w/ Linux

    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Email:
 shinobu@xxxxxxxxx
 skinjo@xxxxxxxxxx

 Life w/ Linux

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com