Re: Brand new cluster -- pg is stuck inactive

dE <de.techno@xxxxxxxxx> · Sun, 15 Oct 2017 09:19:20 +0530



    On 10/15/2017 03:13 AM, Denes Dolhay
      wrote:

    
      Hello,
      Could you include the monitors and the osds as well to your
        clock skew test?
      How did you create the osds? ceph-deploy osd create
        osd1:/dev/sdX osd2:/dev/sdY osd3: /dev/sdZ ?
      Some log from one of the osds would be great!
      

      Kind regards,
      Denes.

      
      On 10/14/2017 07:39 PM, dE wrote:

      
        On 10/14/2017 08:18 PM, David
          Turner wrote:

        
          What are the ownership permissions on your osd
            folders? Clock skew cares about partial seconds.
          It isn't the networking issue because your
            cluster isn't stuck peering. I'm not sure if the creating
            state happens in disk or in the cluster.
          

            On Sat, Oct 14, 2017, 10:01 AM dE . <de.techno@xxxxxxxxx>
              wrote:

            
              I attached 1TB disks to each osd.

                
                  cluster 8161c90e-dbd2-4491-acf8-74449bef916a

                       health HEALTH_ERR

                              clock skew detected on mon.1, mon.2
              
              
                              64 pgs are stuck inactive for more than
                  300 seconds

                              64 pgs stuck inactive

                              too few PGs per OSD (21 < min 30)

                
                            Monitor clock skew detected 

                       monmap e1: 3 mons at {0=10.247.103.139:8567/0,1=10.247.103.140:8567/0,2=10.247.103.141:8567/0}

                              election epoch 12, quorum 0,1,2 0,1,2

                       osdmap e10: 3 osds: 3 up, 3 in

                              flags sortbitwise,require_jewel_osds

                        pgmap v38: 64 pgs, 1 pools, 0 bytes data, 0
                  objects

                              33963 MB used, 3037 GB / 3070 GB avail

                                    64 creating

                  
                I dont seem to have any clock skews -- 

                  or i in {139..141}; do ssh $i date +%s; done

                  1507989554

                  1507989554

                  1507989554

                
                On Sat, Oct 14, 2017 at 6:41
                  PM, David Turner <drakonstein@xxxxxxxxx>
                  wrote:

                  
                    What is the output of your `ceph
                      status`?
                    
                       
                          On Fri, Oct 13, 2017, 10:09 PM
                            dE <de.techno@xxxxxxxxx>
                            wrote:

                          
                              On
                                10/14/2017 12:53 AM, David Turner wrote:

                              
                                What does your
                                  environment look like?  Someone
                                  recently on the mailing list had PGs
                                  stuck creating because of a networking
                                  issue.
                                

                                  On Fri, Oct 13, 2017 at
                                    2:03 PM Ronny Aasen <ronny+ceph-users@xxxxxxxx>
                                    wrote:

                                  
                                  strange that
                                    no osd is acting for your pg's

                                    can you show the output from

                                    ceph osd tree

                                    
                                    mvh

                                    Ronny Aasen

                                    
                                    On 13.10.2017 18:53, dE wrote:

                                    > Hi,

                                    >

                                    >     I'm running ceph 10.2.5 on
                                    Debian (official package).

                                    >

                                    > It cant seem to create any
                                    functional pools --

                                    >

                                    > ceph health detail

                                    > HEALTH_ERR 64 pgs are stuck
                                    inactive for more than 300 seconds;
                                    64 pgs

                                    > stuck inactive; too few PGs per
                                    OSD (21 < min 30)

                                    > pg 0.39 is stuck inactive for
                                    652.741684, current state creating,
                                    last

                                    > acting []

                                    > pg 0.38 is stuck inactive for
                                    652.741688, current state creating,
                                    last

                                    > acting []

                                    > pg 0.37 is stuck inactive for
                                    652.741690, current state creating,
                                    last

                                    > acting []

                                    > pg 0.36 is stuck inactive for
                                    652.741692, current state creating,
                                    last

                                    > acting []

                                    > pg 0.35 is stuck inactive for
                                    652.741694, current state creating,
                                    last

                                    > acting []

                                    > pg 0.34 is stuck inactive for
                                    652.741696, current state creating,
                                    last

                                    > acting []

                                    > pg 0.33 is stuck inactive for
                                    652.741698, current state creating,
                                    last

                                    > acting []

                                    > pg 0.32 is stuck inactive for
                                    652.741701, current state creating,
                                    last

                                    > acting []

                                    > pg 0.3 is stuck inactive for
                                    652.741762, current state creating,
                                    last

                                    > acting []

                                    > pg 0.2e is stuck inactive for
                                    652.741715, current state creating,
                                    last

                                    > acting []

                                    > pg 0.2d is stuck inactive for
                                    652.741719, current state creating,
                                    last

                                    > acting []

                                    > pg 0.2c is stuck inactive for
                                    652.741721, current state creating,
                                    last

                                    > acting []

                                    > pg 0.2b is stuck inactive for
                                    652.741723, current state creating,
                                    last

                                    > acting []

                                    > pg 0.2a is stuck inactive for
                                    652.741725, current state creating,
                                    last

                                    > acting []

                                    > pg 0.29 is stuck inactive for
                                    652.741727, current state creating,
                                    last

                                    > acting []

                                    > pg 0.28 is stuck inactive for
                                    652.741730, current state creating,
                                    last

                                    > acting []

                                    > pg 0.27 is stuck inactive for
                                    652.741732, current state creating,
                                    last

                                    > acting []

                                    > pg 0.26 is stuck inactive for
                                    652.741734, current state creating,
                                    last

                                    > acting []

                                    > pg 0.3e is stuck inactive for
                                    652.741707, current state creating,
                                    last

                                    > acting []

                                    > pg 0.f is stuck inactive for
                                    652.741761, current state creating,
                                    last

                                    > acting []

                                    > pg 0.3f is stuck inactive for
                                    652.741708, current state creating,
                                    last

                                    > acting []

                                    > pg 0.10 is stuck inactive for
                                    652.741763, current state creating,
                                    last

                                    > acting []

                                    > pg 0.4 is stuck inactive for
                                    652.741773, current state creating,
                                    last

                                    > acting []

                                    > pg 0.5 is stuck inactive for
                                    652.741774, current state creating,
                                    last

                                    > acting []

                                    > pg 0.3a is stuck inactive for
                                    652.741717, current state creating,
                                    last

                                    > acting []

                                    > pg 0.b is stuck inactive for
                                    652.741771, current state creating,
                                    last

                                    > acting []

                                    > pg 0.c is stuck inactive for
                                    652.741772, current state creating,
                                    last

                                    > acting []

                                    > pg 0.3b is stuck inactive for
                                    652.741721, current state creating,
                                    last

                                    > acting []

                                    > pg 0.d is stuck inactive for
                                    652.741774, current state creating,
                                    last

                                    > acting []

                                    > pg 0.3c is stuck inactive for
                                    652.741722, current state creating,
                                    last

                                    > acting []

                                    > pg 0.e is stuck inactive for
                                    652.741776, current state creating,
                                    last

                                    > acting []

                                    > pg 0.3d is stuck inactive for
                                    652.741724, current state creating,
                                    last

                                    > acting []

                                    > pg 0.22 is stuck inactive for
                                    652.741756, current state creating,
                                    last

                                    > acting []

                                    > pg 0.21 is stuck inactive for
                                    652.741758, current state creating,
                                    last

                                    > acting []

                                    > pg 0.a is stuck inactive for
                                    652.741783, current state creating,
                                    last

                                    > acting []

                                    > pg 0.20 is stuck inactive for
                                    652.741761, current state creating,
                                    last

                                    > acting []

                                    > pg 0.9 is stuck inactive for
                                    652.741787, current state creating,
                                    last

                                    > acting []

                                    > pg 0.1f is stuck inactive for
                                    652.741764, current state creating,
                                    last

                                    > acting []

                                    > pg 0.8 is stuck inactive for
                                    652.741790, current state creating,
                                    last

                                    > acting []

                                    > pg 0.7 is stuck inactive for
                                    652.741792, current state creating,
                                    last

                                    > acting []

                                    > pg 0.6 is stuck inactive for
                                    652.741794, current state creating,
                                    last

                                    > acting []

                                    > pg 0.1e is stuck inactive for
                                    652.741770, current state creating,
                                    last

                                    > acting []

                                    > pg 0.1d is stuck inactive for
                                    652.741772, current state creating,
                                    last

                                    > acting []

                                    > pg 0.1c is stuck inactive for
                                    652.741774, current state creating,
                                    last

                                    > acting []

                                    > pg 0.1b is stuck inactive for
                                    652.741777, current state creating,
                                    last

                                    > acting []

                                    > pg 0.1a is stuck inactive for
                                    652.741784, current state creating,
                                    last

                                    > acting []

                                    > pg 0.2 is stuck inactive for
                                    652.741812, current state creating,
                                    last

                                    > acting []

                                    > pg 0.31 is stuck inactive for
                                    652.741762, current state creating,
                                    last

                                    > acting []

                                    > pg 0.19 is stuck inactive for
                                    652.741789, current state creating,
                                    last

                                    > acting []

                                    > pg 0.11 is stuck inactive for
                                    652.741797, current state creating,
                                    last

                                    > acting []

                                    > pg 0.18 is stuck inactive for
                                    652.741793, current state creating,
                                    last

                                    > acting []

                                    > pg 0.1 is stuck inactive for
                                    652.741820, current state creating,
                                    last

                                    > acting []

                                    > pg 0.30 is stuck inactive for
                                    652.741769, current state creating,
                                    last

                                    > acting []

                                    > pg 0.17 is stuck inactive for
                                    652.741797, current state creating,
                                    last

                                    > acting []

                                    > pg 0.0 is stuck inactive for
                                    652.741829, current state creating,
                                    last

                                    > acting []

                                    > pg 0.2f is stuck inactive for
                                    652.741774, current state creating,
                                    last

                                    > acting []

                                    > pg 0.16 is stuck inactive for
                                    652.741802, current state creating,
                                    last

                                    > acting []

                                    > pg 0.12 is stuck inactive for
                                    652.741807, current state creating,
                                    last

                                    > acting []

                                    > pg 0.13 is stuck inactive for
                                    652.741807, current state creating,
                                    last

                                    > acting []

                                    > pg 0.14 is stuck inactive for
                                    652.741807, current state creating,
                                    last

                                    > acting []

                                    > pg 0.15 is stuck inactive for
                                    652.741808, current state creating,
                                    last

                                    > acting []

                                    > pg 0.23 is stuck inactive for
                                    652.741792, current state creating,
                                    last

                                    > acting []

                                    > pg 0.24 is stuck inactive for
                                    652.741793, current state creating,
                                    last

                                    > acting []

                                    > pg 0.25 is stuck inactive for
                                    652.741793, current state creating,
                                    last

                                    > acting []

                                    >

                                    > I got 3 OSDs --

                                    >

                                    > ceph osd stat

                                    >      osdmap e8: 3 osds: 3 up, 3
                                    in

                                    >             flags
                                    sortbitwise,require_jewel_osds

                                    >

                                    > ceph osd pool ls detail

                                    > pool 0 'rbd' replicated size 3
                                    min_size 2 crush_ruleset 0
                                    object_hash

                                    > rjenkins pg_num 64 pgp_num 64
                                    last_change 1 flags hashpspool

                                    > stripe_width 0

                                    >

                                    > The state inactive seems to be
                                    odd for a brand new pool with no
                                    data.

                                    >

                                    > This's my ceph.conf --

                                    >

                                    > [global]

                                    > fsid =
                                    8161c91e-dbd2-4491-adf8-74446bef916a

                                    > auth cluster required = cephx

                                    > auth service required = cephx

                                    > auth client required = cephx

                                    > debug = 10/10

                                    > mon host = 10.242.103.139:8567,10.242.103.140:8567,10.242.103.141:8567

                                    > [mon]

                                    > ms bind ipv6 = false

                                    > mon data = "">
                                    > mon addr = 0.0.0.0:8567

                                    > mon warn on legacy crush
                                    tunables = true

                                    > mon crush min required version
                                    = jewel

                                    > mon initial members = 0,1,2

                                    > keyring = /etc/ceph/mon_keyring

                                    > log file =
                                    /var/log/ceph/mon.log

                                    > [osd]

                                    > osd data = "">
                                    > osd journal =
                                    /srv/ceph/osd/osd_journal

                                    > osd journal size = 10240

                                    > osd recovery delay start = 10

                                    > osd recovery thread timeout =
                                    60

                                    > osd recovery max active = 1

                                    > osd recovery max chunk =
                                    10485760

                                    > osd max backfills = 2

                                    > osd backfill retry interval =
                                    60

                                    > osd backfill scan min = 100

                                    > osd backfill scan max = 1000

                                    > keyring = /etc/ceph/osd_keyring

                                    >

                                    > The monitors run on the same
                                    host as osds.

                                    >

                                    > Any help will be appreciated
                                    highly!

                                    >

                                    >
                                    _______________________________________________

                                    > ceph-users mailing list

                                    > ceph-users@xxxxxxxxxxxxxx

                                    > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                                    
_______________________________________________

                                    ceph-users mailing list

                                    ceph-users@xxxxxxxxxxxxxx

                                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                                  
                                _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                              
                              These are VMs with a Linux bridge for
                                connectivity.
                              vlan haver been created over teamed
                                interfaces for the primary interface.
                              The osds can be seen as up and in and
                                there's a quorum, so not a connectivity
                                issue.

                              
        ceph:root. I tried ceph:ceph, and also ran ceph-osd as root.

        
        _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

      
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    The monitors and OSDs are in the same host.
    The output of one of the OSDs (run directly on the terminal)
    ceph-osd -i 0 -f -d --setuser ceph --setgroup ceph    

      starting osd.0 at :/0 osd_data /srv/ceph/osd
      /srv/ceph/osd/osd_journal

      2017-10-15 09:03:20.234260 7f49bdb00900  0 set uid:gid to
      64045:64045 (ceph:ceph)

      2017-10-15 09:03:20.234269 7f49bdb00900  0 ceph version 10.2.5
      (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid
      1068

      2017-10-15 09:03:20.234636 7f49bdb00900  0 pidfile_write: ignore
      empty --pid-file

      2017-10-15 09:03:20.247340 7f49bdb00900  0
      filestore(/srv/ceph/osd) backend xfs (magic 0x58465342)

      2017-10-15 09:03:20.247940 7f49bdb00900  0
      genericfilestorebackend(/srv/ceph/osd) detect_features: FIEMAP
      ioctl is disabled via 'filestore fiemap' config option

      2017-10-15 09:03:20.247959 7f49bdb00900  0
      genericfilestorebackend(/srv/ceph/osd) detect_features:
      SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole'
      config option

      2017-10-15 09:03:20.247982 7f49bdb00900  0
      genericfilestorebackend(/srv/ceph/osd) detect_features: splice is
      supported

      2017-10-15 09:03:20.248777 7f49bdb00900  0
      genericfilestorebackend(/srv/ceph/osd) detect_features: syncfs(2)
      syscall fully supported (by glibc and kernel)

      2017-10-15 09:03:20.248820 7f49bdb00900  0
      xfsfilestorebackend(/srv/ceph/osd) detect_feature: extsize is
      disabled by conf

      2017-10-15 09:03:20.249386 7f49bdb00900  1 leveldb: Recovering log
      #5

      2017-10-15 09:03:20.249420 7f49bdb00900  1 leveldb: Level-0 table
      #7: started

      2017-10-15 09:03:20.250334 7f49bdb00900  1 leveldb: Level-0 table
      #7: 146 bytes OK

      2017-10-15 09:03:20.252409 7f49bdb00900  1 leveldb: Delete type=0
      #5

      
      2017-10-15 09:03:20.252449 7f49bdb00900  1 leveldb: Delete type=3
      #4

      
      2017-10-15 09:03:20.252552 7f49bdb00900  0
      filestore(/srv/ceph/osd) mount: enabling WRITEAHEAD journal mode:
      checkpoint is not enabled

      2017-10-15 09:03:20.252708 7f49bdb00900 -1 journal
      FileJournal::_open: disabling aio for non-block journal.  Use
      journal_force_aio to force use of aio anyway

      2017-10-15 09:03:20.252714 7f49bdb00900  1 journal _open
      /srv/ceph/osd/osd_journal fd 17: 10737418240 bytes, block size
      4096 bytes, directio = 1, aio = 0

      2017-10-15 09:03:20.253053 7f49bdb00900  1 journal _open
      /srv/ceph/osd/osd_journal fd 17: 10737418240 bytes, block size
      4096 bytes, directio = 1, aio = 0

      2017-10-15 09:03:20.255212 7f49bdb00900  1
      filestore(/srv/ceph/osd) upgrade

      2017-10-15 09:03:20.258680 7f49bdb00900  0 <cls>
      cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan

      2017-10-15 09:03:20.259598 7f49bdb00900  0 <cls>
      cls/hello/cls_hello.cc:305: loading cls_hello

      2017-10-15 09:03:20.327155 7f49bdb00900  0 osd.0 0 crush map has
      features 2199057072128, adjusting msgr requires for clients

      2017-10-15 09:03:20.327167 7f49bdb00900  0 osd.0 0 crush map has
      features 2199057072128 was 8705, adjusting msgr requires for mons

      2017-10-15 09:03:20.327171 7f49bdb00900  0 osd.0 0 crush map has
      features 2199057072128, adjusting msgr requires for osds

      2017-10-15 09:03:20.327199 7f49bdb00900  0 osd.0 0 load_pgs

      2017-10-15 09:03:20.327210 7f49bdb00900  0 osd.0 0 load_pgs opened
      0 pgs

      2017-10-15 09:03:20.327216 7f49bdb00900  0 osd.0 0 using 0 op
      queue with priority op cut off at 64.

      2017-10-15 09:03:20.331681 7f49bdb00900 -1 osd.0 0 log_to_monitors
      {default=true}

      2017-10-15 09:03:20.339963 7f49bdb00900  0 osd.0 0 done with init,
      starting boot process

      sh: 1: lsb_release: not found

      2017-10-15 09:03:20.344114 7f49a25d3700 -1 lsb_release_parse -
      pclose failed: (13) Permission denied

      2017-10-15 09:03:20.420408 7f49ae759700  0 osd.0 6 crush map has
      features 288232576282525696, adjusting msgr requires for clients

      2017-10-15 09:03:20.420587 7f49ae759700  0 osd.0 6 crush map has
      features 288232576282525696 was 2199057080833, adjusting msgr
      requires for mons

      2017-10-15 09:03:20.420596 7f49ae759700  0 osd.0 6 crush map has
      features 288232576282525696, adjusting msgr requires for osds
    The cluster was created from scratch. Steps for creating OSDs --
      

    ceph osd crush tunables jewel
    ceph osd create f0960666-ad75-11e7-abc4-cec278b6b50a 0

      ceph osd create 0e6295bc-adab-11e7-abc4-cec278b6b50a 1

      ceph osd create 0e629828-adab-11e7-abc4-cec278b6b50a 2
    ceph-osd -i 0/1/2 --mkfs --osd-uuid
f0960666-ad75-11e7-abc4-cec278b6b50a/0e6295bc-adab-11e7-abc4-cec278b6b50a/0e629828-adab-11e7-abc4-cec278b6b50a
      -f -d (for each OSD)

    
    chown -R ceph /srv/ceph/osd/
    Ceph was started with -- 

    
    ceph-osd -i 0/1/2 -f -d --setuser ceph --setgroup ceph
    I'm skipped out the authentication part since the same problem
      occurs without cephx (set to none).
    In the mean time luminous works great with the same setup.

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com