Re: Global, Synchronous Blocked Requests

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 27 Nov 2015 22:53:40 -0500

On Fri, Nov 27, 2015 at 10:24 PM, Brian Felton <bjfelton@xxxxxxxxx> wrote:
> Greetings Ceph Community,
>
> We are running a Hammer cluster (0.94.3-1) in production that recently
> experienced asymptotic performance degradation.  We've been migrating data
> from an older non-Ceph cluster at a fairly steady pace for the past eight
> weeks (about 5TB a week).  Overnight, the ingress rate dropped by 95%.  Upon
> investigation, we found we were receiving hundreds of thousands of 'slow
> request' warnings.
>
> The cluster is being used as an S3-compliant object storage solution.  What
> has been extremely problematic is that all cluster writes are being blocked
> simultaneously.  When something goes wrong, we've observed our transfer jobs
> (6-8 jobs, running across 4 servers) all simultaneously block on writes for
> 10-60 seconds, then release and continue simultaneously.  The blocks occur
> very frequently (at least once a minute after the previous block has
> cleared).
>
> Our setup is as follows:
>
>  - 5 monitor nodes (VMs: 2 vCPU, 4GB RAM, Ubuntu 14.04.3, kernel 3.13.0-48)
>  - 2 RGW nodes (VMs: 2 vCPU, 4GB RAM, Ubuntu 14.04.3, kernel 3.13.0-48)
>  - 9 Storage nodes (Supermicro server: 32 CPU, 256GB RAM, Ubuntu 14.04.3,
> kernel 3.13.0-46)
>
> Each storage server contains 72 6TB SATA drives for Ceph (648 OSDs, ~3.5PB
> in total).  Each disk is set up as its own ZFS zpool.  Each OSD has a 10GB
> journal, located within the disk's zpool.
>
> Other information that might be pertinent:
>  - All servers (and VMs) use NTP to sync clocks.
>  - The cluster uses k=7, m=2 erasure coding.
>  - Each storage server has 6 10Gbps ports, with 2 bonded for front-end
> traffic and 4 bonded for back-end traffic.
>  - Ingress and egress traffic is typically a few MB/sec tops, and we've
> stress tested it at levels at least 100x what we normally see
>  - We have pushed a few hundred TB into the cluster during burn-in without
> issue
>
> Given the global nature of the failure, we initially suspected networking
> issues.  After a solid day of investigation, we were unable to find any
> reason to suspect the network (no dropped packets on FE or BE networks, no
> ping loss, no switch issues, reasonable iperf tests, etc.).  We next
> examined the storage nodes, but we found no failures of any kind (nothing in
> system/kernel logs, no ZFS errors, iostat/atop/etc. all normal, etc.).
>
> We've also attempted the following, with no success:
>  - Rolling restart of the storage nodes
>  - Rolling restart of the mon nodes
>  - Complete shutdown/restart of all mon nodes
>  - Expansion of RGW capacity from 2 servers to 5
>  - Uncontrollable sobbing
>
> Nothing about the cluster has changed recently -- no OS patches, no Ceph
> patches, no software updates of any kind.  For the months we've had the
> cluster operational, we've had no performance-related issues.  In the days
> leading up to the major performance issue we're now experiencing, the logs
> did record 100 or so 'slow request' events of >30 seconds on subsequent
> days.  After that, the slow requests became a constant, and now our logs are
> spammed with entries like the following:
>
> 2015-11-28 02:30:07.328347 osd.116 192.168.10.10:6832/1689576 1115 : cluster
> [WRN] 2 slow requests, 1 included below; oldest blocked for > 60.024165 secs
> 2015-11-28 02:30:07.328358 osd.116 192.168.10.10:6832/1689576 1116 : cluster
> [WRN] slow request 60.024165 seconds old, received at 2015-11-28
> 02:29:07.304113: osd_op(client.214858.0:6990585
> default.184914.126_2d29cad4962d3ac08bb7c3153188d23f [create 0~0
> [excl],setxattr user.rgw.idtag (22),writefull 0~523488,setxattr
> user.rgw.manifest (444),setxattr user.rgw.acl (371),setxattr
> user.rgw.content_type (1),setxattr user.rgw.etag (33)] 48.158d9795
> ondisk+write+known_if_redirected e15933) currently commit_sent

In this op, it's already sent a commit back to the client — the only
thing that can happen after that point is applying the update to the
local filesystem. The only thing that can block that is if we've got a
local fs sync in progress, or have too much dirty data waiting to be
flushed.

It's vaguely conceivable that this warning just happened to be
generated on an op in a short time between being committed to journal
and applied, and that you just happened to pick it out...but I see
you're using ZFS, which is not a common fs under Ceph. I bet you've
run into something there, but I can't help much since I've never run
ZFS in any capacity and am not sure how it's used by the OSD. (Is it
using the snapshot journal mode? If so I'd start there. Otherwise,
whatever the generic zfs issue discovery and diagnosis stuff is.)
-Greg

>
> We've analyzed the logs on the monitor nodes (ceph.log and
> ceph-mon.<id>.log), and there doesn't appear to be a smoking gun.  The 'slow
> request' events are spread fairly evenly across all 648 OSDs.
>
> A 'ceph health detail' typically shows something like the following:
>
> HEALTH_WARN 41 requests are blocked > 32 sec; 14 osds have slow requests
> 3 ops are blocked > 65.536 sec
> 38 ops are blocked > 32.768 sec
> 1 ops are blocked > 65.536 sec on osd.83
> 1 ops are blocked > 65.536 sec on osd.92
> 4 ops are blocked > 32.768 sec on osd.117
> 1 ops are blocked > 32.768 sec on osd.159
> 2 ops are blocked > 32.768 sec on osd.186
> 1 ops are blocked > 32.768 sec on osd.205
> 10 ops are blocked > 32.768 sec on osd.245
> 1 ops are blocked > 65.536 sec on osd.265
> 1 ops are blocked > 32.768 sec on osd.393
> 2 ops are blocked > 32.768 sec on osd.415
> 10 ops are blocked > 32.768 sec on osd.436
> 1 ops are blocked > 32.768 sec on osd.467
> 5 ops are blocked > 32.768 sec on osd.505
> 1 ops are blocked > 32.768 sec on osd.619
> 14 osds have slow requests
>
> We have rarely seen requests eclipse the 120s warning threshold.  The vast
> majority show > 30 seconds, with a few running longer than 60 seconds.  The
> cluster will return to a HEALTH_OK status periodically, especially when
> under light/no load.
>
> At this point, we've pushed about 42TB into the cluster, so we're still
> under 1.5% utilization.  The performance degradation we are experiencing was
> immediate, severe, and has been ongoing for several days now.  I am looking
> for any guidance on how to further diagnose or resolve the issue.  I have
> reviewed several similar threads on this list, but the proposed solutions
> were either not applicable to our situation or did not work.
>
> Please let me know what other information I can provide or what I can do to
> gather additional information.
>
> Many thanks,
>
> Brian
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com