Nothing about the cluster has changed recently -- no OS
patches, no Ceph patches, no software updates of any kind.
For the months we've had the cluster operational, we've had no
performance-related issues. In the days leading up to the
major performance issue we're now experiencing, the logs did
record 100 or so 'slow request' events of >30 seconds on
subsequent days. After that, the slow requests became a
constant, and now our logs are spammed with entries like the
following:
2015-11-28 02:30:07.328347 osd.116
192.168.10.10:6832/1689576
1115 : cluster [WRN] 2 slow requests, 1 included below; oldest
blocked for > 60.024165 secs
2015-11-28 02:30:07.328358 osd.116
192.168.10.10:6832/1689576
1116 : cluster [WRN] slow request 60.024165 seconds old,
received at 2015-11-28 02:29:07.304113:
osd_op(client.214858.0:6990585
default.184914.126_2d29cad4962d3ac08bb7c3153188d23f [create
0~0 [excl],setxattr user.rgw.idtag (22),writefull
0~523488,setxattr user.rgw.manifest (444),setxattr
user.rgw.acl (371),setxattr user.rgw.content_type (1),setxattr
user.rgw.etag (33)] 48.158d9795
ondisk+write+known_if_redirected e15933) currently commit_sent
We've analyzed the logs on the monitor nodes (ceph.log and
ceph-mon.<id>.log), and there doesn't appear to be a
smoking gun. The 'slow request' events are spread fairly
evenly across all 648 OSDs.
A 'ceph health detail' typically shows something like the
following:
HEALTH_WARN 41 requests are blocked > 32 sec; 14 osds have
slow requests
3 ops are blocked > 65.536 sec
38 ops are blocked > 32.768 sec
1 ops are blocked > 65.536 sec on osd.83
1 ops are blocked > 65.536 sec on osd.92
4 ops are blocked > 32.768 sec on osd.117
1 ops are blocked > 32.768 sec on osd.159
2 ops are blocked > 32.768 sec on osd.186
1 ops are blocked > 32.768 sec on osd.205
10 ops are blocked > 32.768 sec on osd.245
1 ops are blocked > 65.536 sec on osd.265
1 ops are blocked > 32.768 sec on osd.393
2 ops are blocked > 32.768 sec on osd.415
10 ops are blocked > 32.768 sec on osd.436
1 ops are blocked > 32.768 sec on osd.467
5 ops are blocked > 32.768 sec on osd.505
1 ops are blocked > 32.768 sec on osd.619
14 osds have slow requests
We have rarely seen requests eclipse the 120s warning
threshold. The vast majority show > 30 seconds, with a few
running longer than 60 seconds. The cluster will return to a
HEALTH_OK status periodically, especially when under light/no
load.
At this point, we've pushed about 42TB into the cluster, so
we're still under 1.5% utilization. The performance
degradation we are experiencing was immediate, severe, and has
been ongoing for several days now. I am looking for any
guidance on how to further diagnose or resolve the issue. I
have reviewed several similar threads on this list, but the
proposed solutions were either not applicable to our situation
or did not work.
Please let me know what other information I can provide or
what I can do to gather additional information.
Many thanks,
Brian