Re: Cinder pool inaccessible after Nautilus upgrade

Adrien Georget <adrien.georget@xxxxxxxxxxx> · Thu, 4 Jul 2019 15:33:52 +0200



    It appears that if the client or Openstack cinder service is in the
    same network as Ceph, it works. 

    In the Openstack network it fails, but only on this particular pool!
    It was working well before the upgrade and no changes have been made
    on network side.

    Very strange issue. I checked the Ceph release notes in order to
    find network changes but found nothing relevant.

    Only the biggest pool is concerned, same pool config, same hosts,
    ACLs all open, no iptables, ...

    
    Anything else to check?

    We are thinking about adding a VNIC to all Ceph and Openstack hosts
    in order to be in the same subnet.

    
    Adrien

    
    Le 03/07/2019 à 13:46, Adrien Georget a
      écrit :

    
      Hi,

      
      With --debug-objecter=20, I found that the rados ls command hangs
      looping on laggy messages : 

      
      2019-07-03 13:33:24.913 7efc402f5700 10
        client.21363886.objecter _op_submit op 0x7efc3800dc10

      2019-07-03 13:33:24.913 7efc402f5700 20
        client.21363886.objecter _calc_target epoch 13146 base  @3
        precalc_pgid 1 pgid 3.100 is_read

      2019-07-03 13:33:24.913 7efc402f5700 20
        client.21363886.objecter _calc_target target  @3 -> pgid
        3.100

      2019-07-03 13:33:24.913 7efc402f5700 10
        client.21363886.objecter _calc_target  raw pgid 3.100 ->
        actual 3.100 acting [29,12,55] primary 29

      2019-07-03 13:33:24.913 7efc402f5700 20
        client.21363886.objecter _get_session s=0x7efc380024c0 osd=29 3

      2019-07-03 13:33:24.913 7efc402f5700 10
        client.21363886.objecter _op_submit oid  '@3' '@3' [pgnls
        start_epoch 13146] tid 11 osd.29

      2019-07-03 13:33:24.913 7efc402f5700 20
        client.21363886.objecter get_session s=0x7efc380024c0 osd=29 3

      2019-07-03 13:33:24.913 7efc402f5700 15
        client.21363886.objecter _session_op_assign 29 11

      2019-07-03 13:33:24.913 7efc402f5700 15
        client.21363886.objecter _send_op 11 to 3.100 on osd.29

      2019-07-03 13:33:24.913 7efc402f5700 20
        client.21363886.objecter put_session s=0x7efc380024c0 osd=29 4

      2019-07-03 13:33:24.913 7efc402f5700  5
        client.21363886.objecter 1 in flight

      2019-07-03 13:33:29.678 7efc3e2f1700 10
        client.21363886.objecter tick

      2019-07-03 13:33:34.678 7efc3e2f1700 10
        client.21363886.objecter tick

      2019-07-03 13:33:39.678 7efc3e2f1700 10
        client.21363886.objecter tick

      2019-07-03 13:33:39.678 7efc3e2f1700  2
        client.21363886.objecter  tid 11 on osd.29 is laggy

      2019-07-03 13:33:39.678 7efc3e2f1700 10
        client.21363886.objecter _maybe_request_map subscribing
        (onetime) to next osd map

      2019-07-03 13:33:44.678 7efc3e2f1700 10
        client.21363886.objecter tick

      2019-07-03 13:33:44.678 7efc3e2f1700  2
        client.21363886.objecter  tid 11 on osd.29 is laggy

      2019-07-03 13:33:44.678 7efc3e2f1700 10
        client.21363886.objecter _maybe_request_map subscribing
        (onetime) to next osd map

      2019-07-03 13:33:49.679 7efc3e2f1700 10
        client.21363886.objecter tick

        ...

        
      I tried to disable this OSD but the problem goes on another
      OSD, and so on.

      The ceph client packages are up to date, all RBD command still
      work from a monitor but not from Openstack controllers.

      And the other Ceph pool on the same OSD host but on different
      disks works perfectly with Openstack...

      
      The issue looks like these old on, but It seems fixed since fews
      years : https://tracker.ceph.com/issues/2454
      and https://tracker.ceph.com/issues/8515

      
      Is there anything more I can check?

      
      Adrien

      
      Le 02/07/2019 à 14:10, Adrien Georget
        a écrit :

      
      Hi
        Eugen, 

        
        The cinder keyring used by the 2 pools is the same, the rbd
        command works using this keyring and ceph.conf used by Openstack
        while the rados ls command stays stuck. 

        
        I tried with the previous ceph-common version used 10.2.5 and
        the last ceph version 14.2.1. 

        With the Nautilus ceph-common version, the 2 cinder-volume
        services crashed... 

        
        Adrien 

        
        Le 02/07/2019 à 13:50, Eugen Block a écrit : 

        Hi, 

          
          did you try to use rbd and rados commands with the cinder
          keyring, not the admin keyring? Did you check if the caps for
          that client are still valid (do the caps differ between the
          two cinder pools)? 

          
          Are the ceph versions on your hypervisors also nautilus? 

          
          Regards, 

          Eugen 

          
          Zitat von Adrien Georget <adrien.georget@xxxxxxxxxxx>:
          

          Hi all, 

            
            I'm facing a very strange issue after migrating my Luminous
            cluster to Nautilus. 

            I have 2 pools configured for Openstack cinder volumes with
            multiple backend setup, One "service" Ceph pool with cache
            tiering and one "R&D" Ceph pool. 

            After the upgrade, the R&D pool became inaccessible for
            Cinder and the cinder-volume service using this pool can't
            start anymore. 

            What is strange is that Openstack and Ceph report no error,
            Ceph cluster is healthy, all OSDs are UP & running and
            the "service" pool is still running well with the other
            cinder service on the same openstack host. 

            I followed exactly the upgrade procedure
            (https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous),
            no problem during the upgrade but I can't understand why
            Cinder still fails with this pool. 

            I can access, list, create volume on this pool with rbd or
            rados command from the monitors, but on the Openstack
            hypervisor the rbd or rados ls command stay stuck and rados
            ls give this message (|134.158.208.37 is an OSD
            node,10.158.246.214 an Openstack hypervisor) |: 

            
            |2019-07-02 11:26:15.999869 7f63484b4700  0 --
            10.158.246.214:0/1404677569 >>
            134.158.208.37:6884/2457222 pipe(0x555c2bf96240 sd=7 :0 s=1
            pgs=0 cs=0 l=1 c=0x555c2bf97500).fault| 

            
            ceph version 14.2.1 

            Openstack Newton 

            
            I spent 2 days checking everything on Ceph side but I
            couldn't find anything problematic... 

            If you have any hints which can help me, I would appreciate
            :) 

            
            Adrien 

          
          _______________________________________________ 

          ceph-users mailing list 

          ceph-users@xxxxxxxxxxxxxx 

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
          

        _______________________________________________ 

        ceph-users mailing list 

        ceph-users@xxxxxxxxxxxxxx
        

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com