Re: Global, Synchronous Blocked Requests

Daniel Maraio <dmaraio@xxxxxxxxxx> · Fri, 27 Nov 2015 22:52:00 -0500



    Hello,

    
      Can you provide some further details. What are the size of your
    objects, how many objects do you have in your buckets. Are you using
    bucket index sharding, are you sharding your objects over multiple
    buckets? Is the cluster doing any scrubbing during these periods? It
    sounds like you may be having trouble with your rgw bucket index. In
    our cluster, much smaller than yours mind you, it was necessary to
    put the rgw bucket index onto it's own set of osds to isolate it
    from the rest of the cluster IO. We are still using single object
    bucket indexes but have a plan to move to sharded bucket index
    eventually. 

    
      You should determine what OSDs your bucket indexes are located on
    and see if a pattern emerges with the OSDs have have slow requests
    during this periods. You can use the command ' ceph pg ls-by-pool
    .rgw.buckets.index ' to show what pgs/osds the bucket index resides
    on.

    
    - Daniel

    
    On 11/27/2015 10:24 PM, Brian Felton
      wrote:

    
        Greetings Ceph Community,

          
          We are running a Hammer cluster (0.94.3-1) in production that
          recently experienced asymptotic performance degradation. 
          We've been migrating data from an older non-Ceph cluster at a
          fairly steady pace for the past eight weeks (about 5TB a
          week).  Overnight, the ingress rate dropped by 95%.  Upon
          investigation, we found we were receiving hundreds of
          thousands of 'slow request' warnings.  

          
          The cluster is being used as an S3-compliant object storage
          solution.  What has been extremely problematic is that all
          cluster writes are being blocked simultaneously.  When
          something goes wrong, we've observed our transfer jobs (6-8
          jobs, running across 4 servers) all simultaneously block on
          writes for 10-60 seconds, then release and continue
          simultaneously.  The blocks occur very frequently (at least
          once a minute after the previous block has cleared).  

          
          Our setup is as follows:

          
           - 5 monitor nodes (VMs: 2 vCPU, 4GB RAM, Ubuntu 14.04.3,
          kernel 3.13.0-48)

           - 2 RGW nodes (VMs: 2 vCPU, 4GB RAM, Ubuntu 14.04.3, kernel
          3.13.0-48)

           - 9 Storage nodes (Supermicro server: 32 CPU, 256GB RAM,
          Ubuntu 14.04.3, kernel 3.13.0-46)

          
          Each storage server contains 72 6TB SATA drives for Ceph (648
          OSDs, ~3.5PB in total).  Each disk is set up as its own ZFS
          zpool.  Each OSD has a 10GB journal, located within the disk's
          zpool.

          
          Other information that might be pertinent:

           - All servers (and VMs) use NTP to sync clocks.

           - The cluster uses k=7, m=2 erasure coding.

           - Each storage server has 6 10Gbps ports, with 2 bonded for
          front-end traffic and 4 bonded for back-end traffic.

           - Ingress and egress traffic is typically a few MB/sec tops,
          and we've stress tested it at levels at least 100x what we
          normally see

           - We have pushed a few hundred TB into the cluster during
          burn-in without issue  

          
          Given the global nature of the failure, we initially suspected
          networking issues.  After a solid day of investigation, we
          were unable to find any reason to suspect the network (no
          dropped packets on FE or BE networks, no ping loss, no switch
          issues, reasonable iperf tests, etc.).  We next examined the
          storage nodes, but we found no failures of any kind (nothing
          in system/kernel logs, no ZFS errors, iostat/atop/etc. all
          normal, etc.).  

          
          We've also attempted the following, with no success:

           - Rolling restart of the storage nodes

           - Rolling restart of the mon nodes

           - Complete shutdown/restart of all mon nodes

           - Expansion of RGW capacity from 2 servers to 5

        
         - Uncontrollable sobbing

        
          Nothing about the cluster has changed recently -- no OS
          patches, no Ceph patches, no software updates of any kind. 
          For the months we've had the cluster operational, we've had no
          performance-related issues.  In the days leading up to the
          major performance issue we're now experiencing, the logs did
          record 100 or so 'slow request' events of >30 seconds on
          subsequent days.  After that, the slow requests became a
          constant, and now our logs are spammed with entries like the
          following:

          
          2015-11-28 02:30:07.328347 osd.116 192.168.10.10:6832/1689576
          1115 : cluster [WRN] 2 slow requests, 1 included below; oldest
          blocked for > 60.024165 secs

          2015-11-28 02:30:07.328358 osd.116 192.168.10.10:6832/1689576
          1116 : cluster [WRN] slow request 60.024165 seconds old,
          received at 2015-11-28 02:29:07.304113:
          osd_op(client.214858.0:6990585
          default.184914.126_2d29cad4962d3ac08bb7c3153188d23f [create
          0~0 [excl],setxattr user.rgw.idtag (22),writefull
          0~523488,setxattr user.rgw.manifest (444),setxattr
          user.rgw.acl (371),setxattr user.rgw.content_type (1),setxattr
          user.rgw.etag (33)] 48.158d9795
          ondisk+write+known_if_redirected e15933) currently commit_sent

          
          We've analyzed the logs on the monitor nodes (ceph.log and
          ceph-mon.<id>.log), and there doesn't appear to be a
          smoking gun.  The 'slow request' events are spread fairly
          evenly across all 648 OSDs.  

          
          A 'ceph health detail' typically shows something like the
          following:

          
          HEALTH_WARN 41 requests are blocked > 32 sec; 14 osds have
          slow requests

          3 ops are blocked > 65.536 sec

          38 ops are blocked > 32.768 sec

          1 ops are blocked > 65.536 sec on osd.83

          1 ops are blocked > 65.536 sec on osd.92

          4 ops are blocked > 32.768 sec on osd.117

          1 ops are blocked > 32.768 sec on osd.159

          2 ops are blocked > 32.768 sec on osd.186

          1 ops are blocked > 32.768 sec on osd.205

          10 ops are blocked > 32.768 sec on osd.245

          1 ops are blocked > 65.536 sec on osd.265

          1 ops are blocked > 32.768 sec on osd.393

          2 ops are blocked > 32.768 sec on osd.415

          10 ops are blocked > 32.768 sec on osd.436

          1 ops are blocked > 32.768 sec on osd.467

          5 ops are blocked > 32.768 sec on osd.505

          1 ops are blocked > 32.768 sec on osd.619

          14 osds have slow requests

          
          We have rarely seen requests eclipse the 120s warning
          threshold.  The vast majority show > 30 seconds, with a few
          running longer than 60 seconds.  The cluster will return to a
          HEALTH_OK status periodically, especially when under light/no
          load.  

          
          At this point, we've pushed about 42TB into the cluster, so
          we're still under 1.5% utilization.  The performance
          degradation we are experiencing was immediate, severe, and has
          been ongoing for several days now.  I am looking for any
          guidance on how to further diagnose or resolve the issue.  I
          have reviewed several similar threads on this list, but the
          proposed solutions were either not applicable to our situation
          or did not work.  

          
          Please let me know what other information I can provide or
          what I can do to gather additional information.

          
          Many thanks,

          
          Brian

          
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com