Re: EC pools grinding to a screeching halt on Luminous

Mohamad Gebai <mgebai@xxxxxxx> · Mon, 31 Dec 2018 13:30:09 -0500



    On 12/31/18 4:51 AM, Marcus Murwall wrote:

    
      What you say does make sense though as I also get the feeling that
      the osds are just waiting for something. Something that never
      happens and the request finally timeout...

    
    So the OSDs are just completely idle? If not, try using strace
    and/or perf to get some insights into what they're doing.

    
    Maybe someone with better knowledge of EC internals will suggest
    something. In the mean time, you might want to look at the client
    side. Could the client be somehow saturated or blocked on something?
    (If the clients aren't blocked you can use 'perf' or Mark's profiler
    [1] to profile them).

    
    Try benchmarking with an iodepth of 1 and slowly increase it until
    you run into the issue, all while monitoring your resources. You
    might find something that causes the tipping point. Are you able to
    reproduce this using fio? Maybe this is just a client issue..

    
    Sorry for suggesting a bunch of things that are all over the place,
    I'm just trying to understand the state of the cluster (and
    clients). Are both the OSDs and the clients completely blocked and
    make no progress?

    
    Let us know what you find.

    
    Mohamad

    
    [1] https://github.com/markhpc/gdbpmp/

    
      I will have one of our network guys to take a look and get a
      second pair of eyes on it as well, just to make sure I'm not
      missing anything.

      
      Thanks for your help so far Mohamad, I really appreciate it. If
      you have some more ideas/suggestions on where to look please let
      us know.

      
      I wish you all a happy new year.

      
      Regards

      Marcus

      
              Mohamad Gebai
             28 December 2018 at 16:10
          
        
          Hi Marcus,

          
          On 12/27/18 4:21 PM, Marcus
            Murwall wrote:

          
            Hey Mohamad

            
            I work with Florian on this issue.

            Just reinstalled the ceph cluster and triggered the error
            again.

            Looking at iostat -x 1 there is basically no activity at all
            against any of the osds.

          
           We get blocked ops all over the place but here
            are some output from one of the osds that had blocked
            requests: http://paste.openstack.org/show/738721/

          
          Looking at the historic_slow_ops, the step in the pipeline
          that takes the most time is sub_op_applied -> commit_sent.
          I couldn't say exactly what these steps are from a high level
          view, but looking at the code, commit_sent indicates that a
          message has been sent to the OSD's client over the network.
          Can you look for network congestion (the fact that there's
          nothing happening on the disks points in that direction too)?
          Something like iftop might help. Is there anything suspicious
          in the logs?

          
          Also, do you get the same throughput when benchmarking the
          replicated compared to the EC pool?

          
          Mohamad

          
            Regards

            Marcus

             
                    Mohamad Gebai
                   26 December 2018 at
                        18:27
                
              
                What is happening on the individual nodes when you
                  reach that point

                  (iostat -x 1 on the OSD nodes)? Also, what throughput
                  do you get when

                  benchmarking the replicated pool?

                  
                  I guess one way to start would be by looking at
                  ongoing operations at

                  the OSD level:

                  
                  ceph daemon osd.X dump_blocked_ops

                  ceph daemon osd.X dump_ops_in_flight

                  ceph daemon osd.X dump_historic_slow_ops

                  
                  (see ceph daemon osd.X help) for more commands.

                  
                  The first command will show currently blocked
                  operations. The last

                  command shows recent slow operations. You can follow
                  the flow of

                  individual operations, and you might find that the
                  slow operations are

                  all associated with the same few PGs, or that they're
                  spending too much

                  time waiting on something.

                  
                  Hope that helps.

                  
                  Mohamad

                  
                    Florian Haas
                   26 December 2018 at
                        11:20
                
              
                Hi everyone,

                  
                  We have a Luminous cluster (12.2.10) on Ubuntu Xenial,
                  though we have

                  also observed the same behavior on 12.2.7 on Bionic
                  (download.ceph.com

                  doesn't build Luminous packages for Bionic, and 12.2.7
                  is the latest

                  distro build).

                  
                  The primary use case for this cluster is radosgw. 6
                  OSD nodes, 22 OSDs

                  per node, of which 20 are SAS spinners and 2 are NVMe
                  devices. Cluster

                  has been deployed with ceph-ansible stable-3.1, we're
                  using

                  "objectstore: bluestore" and "osd_scenario:
                  collocated".

                  
                  We're using a "class hdd" replicated CRUSH ruleset for
                  all our pools,

                  except:

                  
                  - the bucket index pool, which uses a replicated
                  "class nvme" rule, and

                  - the bucket data pool, which uses an EC
                  (crush-device-class=hdd,

                  crush-failure-domain=host, k=3, m=2).

                  
                  We also have 3 pools that we have created in order to
                  be able to do

                  benchmark runs while leaving the other pools
                  untouched, so we have

                  
                  - bench-repl-hdd, replicated, size 3, using a CRUSH
                  rule with "step take

                  default class hdd"

                  - bench-repl-nvme, replicated, size 3, using a CRUSH
                  rule with "step

                  take default class nvme"

                  - bench-ec-hdd, EC, crush-device-class=hdd,
                  crush-failure-domain=host,

                  k=3, m=2.

                  
                  Baseline benchmarks with "ceph tell osd.* bench" at
                  the default block

                  size of 4M yield pretty exactly the throughput you'd
                  expect from the

                  devices: approx. 185 MB/s from the SAS drives; the
                  NVMe devices

                  currently pull only 650 MB/s on writes but that may
                  well be due to

                  pending conditioning — this is new hardware.

                  
                  Now when we run "rados bench" against the replicated
                  pools, we again get

                  exactly what we expect for a nominally performing but
                  largely untuned

                  system.

                  
                  It's when we try running benchmarks against the EC
                  pool that everything

                  appears to grind to a halt:

                  
                  http://paste.openstack.org/show/738187/

                  
                  After 19 seconds, that pool does not accept a single
                  further object. We

                  simultaneously see slow request warnings creep up in
                  the cluster, and

                  the only thing we can then do is kill the benchmark,
                  and wait for the

                  slow requests to clear out.

                  
                  We've also seen the log messages discussed in

                  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,

                  and they seem to correlate with the slow requests
                  popping up, but from

                  Greg's reply in

                  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028974.html

                  I'm assuming that that's benign and doesn't warrant
                  further investigation.

                  
                  Here's a few things we've tried, to no avail:

                  
                  - Make sure we use the latest Luminous release (we
                  started out on Bionic

                  and 12.2.7, then reinstalled systems with Xenial so we
                  could use 12.2.10).

                  - Enable Bluestore buffered writes
                  (bluestore_default_buffered_write =

                  true); buffered reads are on by default.

                  - Extend the BlueStore cache from 1G to 4G
                  (bluestore_cache_size_hdd =

                  4294967296; each OSD box has 128G RAM so should not
                  run into memory

                  starvation issues with that).

                  
                  But those were basically "let's give this a shot and
                  see if it makes a

                  difference" attempts (it didn't).

                  
                  I'm basically looking for ideas where even to start
                  looking. So if

                  anyone can guide us into the right direction, that
                  would be excellent.

                  Thanks in advance for any help you can offer; it is
                  much appreciated!

                  
                  Cheers,

                  Florian

                
              Marcus Murwall
             27 December 2018 at 22:21
          
        
          Hey Mohamad

          
          I work with Florian on this issue.

          Just reinstalled the ceph cluster and triggered the error
          again.

          Looking at iostat -x 1 there is basically no activity at all
          against any of the osds.

          
          We get blocked ops all over the place but here are some output
          from one of the osds that had blocked requests: http://paste.openstack.org/show/738721/

          
          Regards

          Marcus

          
              Mohamad Gebai
             26 December 2018 at 18:27
          
        
          What is happening on the individual nodes when you reach
            that point

            (iostat -x 1 on the OSD nodes)? Also, what throughput do you
            get when

            benchmarking the replicated pool?

            
            I guess one way to start would be by looking at ongoing
            operations at

            the OSD level:

            
            ceph daemon osd.X dump_blocked_ops

            ceph daemon osd.X dump_ops_in_flight

            ceph daemon osd.X dump_historic_slow_ops

            
            (see ceph daemon osd.X help) for more commands.

            
            The first command will show currently blocked operations.
            The last

            command shows recent slow operations. You can follow the
            flow of

            individual operations, and you might find that the slow
            operations are

            all associated with the same few PGs, or that they're
            spending too much

            time waiting on something.

            
            Hope that helps.

            
            Mohamad

            
              Florian Haas
             26 December 2018 at 11:20
          
        
          Hi everyone,

            
            We have a Luminous cluster (12.2.10) on Ubuntu Xenial,
            though we have

            also observed the same behavior on 12.2.7 on Bionic
            (download.ceph.com

            doesn't build Luminous packages for Bionic, and 12.2.7 is
            the latest

            distro build).

            
            The primary use case for this cluster is radosgw. 6 OSD
            nodes, 22 OSDs

            per node, of which 20 are SAS spinners and 2 are NVMe
            devices. Cluster

            has been deployed with ceph-ansible stable-3.1, we're using

            "objectstore: bluestore" and "osd_scenario: collocated".

            
            We're using a "class hdd" replicated CRUSH ruleset for all
            our pools,

            except:

            
            - the bucket index pool, which uses a replicated "class
            nvme" rule, and

            - the bucket data pool, which uses an EC
            (crush-device-class=hdd,

            crush-failure-domain=host, k=3, m=2).

            
            We also have 3 pools that we have created in order to be
            able to do

            benchmark runs while leaving the other pools untouched, so
            we have

            
            - bench-repl-hdd, replicated, size 3, using a CRUSH rule
            with "step take

            default class hdd"

            - bench-repl-nvme, replicated, size 3, using a CRUSH rule
            with "step

            take default class nvme"

            - bench-ec-hdd, EC, crush-device-class=hdd,
            crush-failure-domain=host,

            k=3, m=2.

            
            Baseline benchmarks with "ceph tell osd.* bench" at the
            default block

            size of 4M yield pretty exactly the throughput you'd expect
            from the

            devices: approx. 185 MB/s from the SAS drives; the NVMe
            devices

            currently pull only 650 MB/s on writes but that may well be
            due to

            pending conditioning — this is new hardware.

            
            Now when we run "rados bench" against the replicated pools,
            we again get

            exactly what we expect for a nominally performing but
            largely untuned

            system.

            
            It's when we try running benchmarks against the EC pool that
            everything

            appears to grind to a halt:

            
            http://paste.openstack.org/show/738187/

            
            After 19 seconds, that pool does not accept a single further
            object. We

            simultaneously see slow request warnings creep up in the
            cluster, and

            the only thing we can then do is kill the benchmark, and
            wait for the

            slow requests to clear out.

            
            We've also seen the log messages discussed in

            http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,

            and they seem to correlate with the slow requests popping
            up, but from

            Greg's reply in

            http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028974.html

            I'm assuming that that's benign and doesn't warrant further
            investigation.

            
            Here's a few things we've tried, to no avail:

            
            - Make sure we use the latest Luminous release (we started
            out on Bionic

            and 12.2.7, then reinstalled systems with Xenial so we could
            use 12.2.10).

            - Enable Bluestore buffered writes
            (bluestore_default_buffered_write =

            true); buffered reads are on by default.

            - Extend the BlueStore cache from 1G to 4G
            (bluestore_cache_size_hdd =

            4294967296; each OSD box has 128G RAM so should not run into
            memory

            starvation issues with that).

            
            But those were basically "let's give this a shot and see if
            it makes a

            difference" attempts (it didn't).

            
            I'm basically looking for ideas where even to start looking.
            So if

            anyone can guide us into the right direction, that would be
            excellent.

            Thanks in advance for any help you can offer; it is much
            appreciated!

            
            Cheers,

            Florian

          
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com