In the end this came down to one slow OSD. There were no hardware
issues so have to just assume something gummed up during rebalancing
and peering.
I restarted the osd process after setting the cluster to noout.
After the osd was restarted the rebalance completed and the cluster
returned to health ok.
As soon as the osd restarted all previously hanging operations
returned to normal.
I'm surprised by a single slow OSD impacting access to the entire
cluster. I understand now that only the primary osd is used for
reads and writes must go to the primary then secondary, but I would
have expected the impact to be more contained.
We currently build XFS file systems directly on RBD images. I'm
wondering if there would be any value in using an LVM abstraction on
top to spread access to other osds for read and failure scenarios.
Any thoughts on the above appreciated.
~jpr
On 05/28/2015 03:18 PM, John-Paul
Robinson wrote:
To follow up on the original post,
Further digging indicates this is a problem with RBD image access
and is not related to NFS-RBD interaction as initially suspected.
The nfsd is simply hanging as a result of a hung request to the
XFS file system mounted on our RBD-NFS gateway. This hung XFS
call is caused by a problem with the RBD module interacting with
our Ceph pool.
I've found a reliable way to trigger a hang directly on an rbd
image mapped into our RBD-NFS gateway box. The image contains an
XFS file system. When I try to list the contents of a particular
directory, the request hangs indefinitely.
Two weeks ago our ceph status was:
jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
nova status
health HEALTH_WARN 1 near full osd(s)
monmap e1: 3 mons at
{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
osdmap e5978: 66 osds: 66 up, 66 in
pgmap v26434260: 3072 pgs: 3062 active+clean, 6
active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712
GB data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr,
1op/s
mdsmap e1: 0/0/1 up
The near full osd was number 53 and we updated our crush map to
rewieght the osd. All of the OSDs had a weight of 1 based on the
assumption that all osds were 2.0TB. Apparently one of our severs
had the OSDs Sized to 2.8TB and this caused the OSD imbalance
eventhough we are only at 50% utilization. We reweighted the near
full osd to .8 and that initiated a rebalance that has since
relieved the 95% full condition on that OSD.
However, since that time the repeering has not completed and we
suspect this is causing problems with our access of RBD images.
Our current ceph status is:
jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
nova status
health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4
pgs stuck unclean; recovery 9/23842120 degraded (0.000%)
monmap e1: 3 mons at
{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
osdmap e6036: 66 osds: 66 up, 66 in
pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean,
9 active+clean+scrubbing, 1 remapped+peering, 3
active+clean+scrubbing+deep; 45868 GB data, 92006 GB used,
51297 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120
degraded (0.000%)
mdsmap e1: 0/0/1 up
Here are further details on our stuck pgs:
jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
nova pg dump_stuck inactive
ok
pg_stat objects mip degr unf bytes log
disklog state state_stamp v reported
up acting last_scrub scrub_stamp last_deep_scrub
deep_scrub_stamp
3.3af 11600 0 0 0 47941791744
153812 153812 remapped+peering 2015-05-15
12:47:17.223786 5979'293066 6000'1248735 [48,62]
[53,48,62] 5979'293056 2015-05-15
07:40:36.275563 5979'293056 2015-05-15
07:40:36.275563
jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck unclean
ok
pg_stat objects mip degr unf bytes log
disklog state state_stamp v reported
up acting last_scrub scrub_stamp last_deep_scrub
deep_scrub_stamp
3.106 11870 0 9 0 49010106368
163991 163991 active 2015-05-15 12:47:19.761469
6035'356332 5968'1358516 [62,53] [62,53] 5979'356242
2015-05-14 22:22:12.966150 5979'351351 2015-05-12
18:04:41.838686
5.104 0 0 0 0 0 0
0 active 2015-05-15 12:47:19.800676 0'0
5968'1615 [62,53] [62,53] 0'0 2015-05-14
18:43:22.425105 0'0 2015-05-08 10:19:54.938934
4.105 0 0 0 0 0 0
0 active 2015-05-15 12:47:19.801028 0'0
5968'1615 [62,53] [62,53] 0'0 2015-05-14
18:43:04.434826 0'0 2015-05-14 18:43:04.434826
3.3af 11600 0 0 0 47941791744
153812 153812 remapped+peering 2015-05-15
12:47:17.223786 5979'293066 6000'1248735 [48,62]
[53,48,62] 5979'293056 2015-05-15
07:40:36.275563 5979'293056 2015-05-15
07:40:36.275563
The servers in the pool are not overloaded. On the ceph server
that originally had the nearly full osd, (osd 53), I'm seeing
entries like this in the osd log:
2015-05-28 06:25:02.900129 7f2ea8a4f700 0 log
[WRN] : 6 slow requests, 6 included below; oldest blocked for
> 1096430.805069 secs
2015-05-28 06:25:02.900145 7f2ea8a4f700 0 log [WRN] : slow
request 1096430.805069 seconds old, received at 2015-05-15
13:51:12.094956: osd_op
(client.153458.0:274955541 rb.0.d204.238e1f29.0000000844d2
[sparse-read 3706880~4096] 3.773a03af) v4 currently reached pg
2015-05-28 06:25:02.900156 7f2ea8a4f700 0 log [WRN] : slow
request 1095097.235862 seconds old, received at 2015-05-15
14:13:25.664163: osd_op(client.154124.1:68230513
rb.0.176f3.238e1f29.000000002cd8 [write 0~524288] 3.4329faf)
v4 currently reached pg
2015-05-28 06:25:02.900162 7f2ea8a4f700 0 log [WRN] : slow
request 1095097.214366 seconds old, received at 2015-05-15
14:13:25.685659: osd_op(client.154124.1:68230514
rb.0.176f3.238e1f29.000000002cd8 [write 524288~524288]
3.4329faf) v4 currently reached pg
2015-05-28 06:25:02.900169 7f2ea8a4f700 0 log [WRN] : slow
request 1095097.191725 seconds old, received at 2015-05-15
14:13:25.708300: osd_op(client.154124.1:68230515
rb.0.176f3.238e1f29.000000002cd8 [write 1048576~524288]
3.4329faf) v4 currently reached pg
2015-05-28 06:25:02.900175 7f2ea8a4f700 0 log [WRN] : slow
request 1095097.169399 seconds old, received at 2015-05-15
14:13:25.730626: osd_op(client.154124.1:68230516
...
These happen once per second and the only thing that changes in
the whole batch of messages is that the seconds old increases.
According to dmesg, syslog, and bmc there are no current hw
malfunctions on this machine.
I'm assuming that whatever is causing the slow request in the osd
logs might be gumming up the works.
What I don't get is how to get my pg out of the stuck inactive
state. The doc solution hints don't seem to apply in this case.
Finally here is the health detail, it sorta repeats the above:
jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
nova osd lspools
3 rbd,4 data,5 metadata,
jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova health
detail
HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck
unclean; recovery 9/23842120 degraded (0.000%)
pg 3.3af is stuck inactive for 1132739.583540, current state
remapped+peering, last acting [53,48,62]
pg 3.106 is stuck unclean for 1131876.351202, current state
active, last acting [62,53]
pg 5.104 is stuck unclean for 1131876.350931, current state
active, last acting [62,53]
pg 4.105 is stuck unclean for 1131876.351729, current state
active, last acting [62,53]
pg 3.3af is stuck unclean for 1132739.583697, current state
remapped+peering, last acting [53,48,62]
pg 3.3af is remapped+peering, acting [53,48,62]
recovery 9/23842120 degraded (0.000%)
I'd like to get to ceph health OK to see if that's contributing to
the problem.
Thanks for any pointers.
~jpr
On 05/24/2015 11:02 PM, Christian
Balzer wrote:
Hello,
lets compare your case with John-Paul's.
Different OS and Ceph versions (thus we can assume different NFS versions
as well).
The only common thing is that both of you added OSDs and are likely
suffering from delays stemming from Ceph re-balancing or deep-scrubbing.
Ceph logs will only pipe up when things have been blocked for more than 30
seconds, NFS might take offense to lower values (or the accumulation of
several distributed delays).
You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new storage nodes
(so could there be something different with those nodes?), how busy is your
network, CPU.
Running something like collectd to gather all ceph perf data and other
data from the storage nodes and then feeding it to graphite (or similar)
can be VERY helpful to identify if something is going wrong and what it is
in particular.
Otherwise run atop on your storage nodes to identify if CPU, network,
specific HDDs/OSDs are bottlenecks.
Deep scrubbing can be _very_ taxing, do your problems persist if inject
into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that
until it hurts again) or if you turn of deep scrubs altogether for the
moment?
Christian
On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
We see something very similar on our Ceph cluster, starting as of today.
We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
OpenStack cluster (we applied the RBD patches for live migration etc)
On this cluster we have a big ownCloud installation (Sync & Share) that
stores its files on three NFS servers, each mounting 6 2TB RBD volumes
and exposing them to around 10 web server VMs (we originally started
with one NFS server with a 100TB volume, but that has become unwieldy).
All of the servers (hypervisors, ceph storage nodes and VMs) are using
Ubuntu 14.04
Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
rebalancing process ended this morning (after around 12 hours) The
cluster has been clean since then:
cluster b1f3f4c8-xxxxx
health HEALTH_OK
monmap e2: 3 mons at
{zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
active+clean 17 active+clean+scrubbing+deep
client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
At midnight, we run a script that creates an RBD snapshot of all RBD
volumes that are attached to the NFS servers (for backup purposes).
Looking at our monitoring, around that time, one of the NFS servers
became unresponsive and took down the complete ownCloud installation
(load on the web server was > 200 and they had lost some of the NFS
mounts)
Rebooting the NFS server solved that problem, but the NFS kernel server
kept crashing all day long after having run between 10 to 90 minutes.
We initially suspected a corrupt rbd volume (as it seemed that we could
trigger the kernel crash by just “ls -l” one of the volumes, but
subsequent “xfs_repair -n” checks on those RBD volumes showed no
problems.
We migrated the NFS server off of its hypervisor, suspecting a problem
with RBD kernel modules, rebooted the hypervisor but the problem
persisted (both on the new hypervisor, and on the old one when we
migrated it back)
We changed the /etc/default/nfs-kernel-server to start up 256 servers
(even though the defaults had been working fine for over a year)
Only one of our 3 NFS servers crashes (see below for syslog information)
- the other 2 have been fine
May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD:
Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting 90-second
grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May
23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version 0.5.0
(16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
[ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
/dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
collectd[1872]: python: Plugin loaded but not configured. May 23
21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering
read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283] init:
plymouth-upstart-bridge main process ended, respawning May 23 21:51:26
drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked for
more than 120 seconds.
May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted
3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
[ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message. May 23 21:51:26 drive-nfs1 kernel: [ 600.781504]
nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23
21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50
0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180
ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 kernel:
[ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff
ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515] Call
Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523]
[<ffffffff81727749>] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
drive-nfs1 kernel: [ 600.781526] [<ffffffff817295b5>]
__mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 kernel:
[ 600.781528] [<ffffffff8172964f>] mutex_lock+0x1f/0x2f May 23
21:51:26 drive-nfs1 kernel: [ 600.781557] [<ffffffffa03b1761>]
nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
[ 600.781568] [<ffffffffa03b044b>] ? fh_verify+0x14b/0x5e0 [nfsd] May
23 21:51:26 drive-nfs1 kernel: [ 600.781591] [<ffffffffa03b1bb9>]
nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
[ 600.781613] [<ffffffffa03be90a>] nfsd4_lookup+0x1a/0x20 [nfsd] May
23 21:51:26 drive-nfs1 kernel: [ 600.781628] [<ffffffffa03c055a>]
nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
kernel: [ 600.781638] [<ffffffffa03acd3b>] nfsd_dispatch+0xbb/0x200
[nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662]
[<ffffffffa028762d>] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
21:51:26 drive-nfs1 kernel: [ 600.781678] [<ffffffffa0287997>]
svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 kernel:
[ 600.781687] [<ffffffffa03ac71f>] nfsd+0xbf/0x130 [nfsd] May 23
21:51:26 drive-nfs1 kernel: [ 600.781696] [<ffffffffa03ac660>] ?
nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
[ 600.781702] [<ffffffff8108b6b2>] kthread+0xd2/0xf0 May 23 21:51:26
drive-nfs1 kernel: [ 600.781707] [<ffffffff8108b5e0>] ?
kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 kernel:
[ 600.781712] [<ffffffff81733868>] ret_from_fork+0x58/0x90 May 23
21:51:26 drive-nfs1 kernel: [ 600.781717] [<ffffffff8108b5e0>] ?
kthread_create_on_node+0x1c0/0x1c0
Before each crash, we see the disk utilization of one or two random
mounted RBD volumes to go to 100% - there is no pattern on which of the
RBD disks start to act up.
We have scoured the log files of the Ceph cluster for any signs of
problems but came up empty.
The NFS server has almost no load (compared to regular usage) as most
sync clients are either turned off (weekend) or have given up connecting
to the server.
There haven't been any configuration change on the NFS servers prior to
the problems. The only change was the adding of 23 OSDs.
We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
Our team is completely out of ideas. We have removed the 100TB volume
from the nfs server (we used the downtime to migrate the last data off
of it to one of the smaller volumes). The NFS server has been running
for 30 minutes now (with close to no load) but we don’t really expect it
to make it until tomorrow.
send help
Jens-Christian
|