Re: NFS interaction with RBD

John-Paul Robinson <jpr@xxxxxxx> · Fri, 29 May 2015 15:09:19 -0500

    In the end this came down to one slow OSD.  There were no hardware
    issues so have to just assume something gummed up during rebalancing
    and peering.

    I restarted the osd process after setting the cluster to noout. 
    After the osd was restarted the rebalance completed and the cluster
    returned to health ok.

    As soon as the osd restarted all previously hanging operations
    returned to normal.

    I'm surprised by a single slow OSD impacting access to the entire
    cluster.   I understand now that only the primary osd is used for
    reads and writes must go to the primary then secondary, but I would
    have expected  the impact to be more contained.

    We currently build XFS file systems directly on RBD images.  I'm
    wondering if there would be any value in using an LVM abstraction on
    top to spread access to other osds  for read and failure scenarios.

    Any thoughts on the above appreciated.

    ~jpr

    On 05/28/2015 03:18 PM, John-Paul
      Robinson wrote:

      To follow up on the original post,

      Further digging indicates this is a problem with RBD image access
      and is not related to NFS-RBD interaction as initially suspected. 
      The nfsd is simply hanging as a result of a hung request to the
      XFS file system mounted on our RBD-NFS gateway.    This hung XFS
      call is caused by a problem with the RBD module interacting with
      our Ceph pool.

      I've found a reliable way to trigger a hang directly on an rbd
      image mapped into our RBD-NFS gateway box.  The image contains an
      XFS file system.  When I try to list the contents of a particular
      directory, the request hangs indefinitely.

      Two weeks ago our ceph status was:

      jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
          nova status

           health HEALTH_WARN 1 near full osd(s)

           monmap e1: 3 mons at
          {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},

          election epoch 350, quorum 0,1,2
          da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0

           osdmap e5978: 66 osds: 66 up, 66 in

            pgmap v26434260: 3072 pgs: 3062 active+clean, 6
          active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712
          GB data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr,
          1op/s

           mdsmap e1: 0/0/1 up

      The near full osd was number 53 and we updated our crush map to
      rewieght the osd.  All of the OSDs had a weight of 1 based on the
      assumption that all osds were 2.0TB.  Apparently one of our severs
      had the OSDs Sized to 2.8TB and this caused the OSD imbalance
      eventhough we are only at 50% utilization.  We reweighted the near
      full osd to .8 and that initiated a rebalance that has since
      relieved the 95% full condition on that OSD.

      However, since that time the repeering has not completed and we
      suspect this is causing problems with our access of RBD images.  
      Our current ceph status is:

      jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
          nova status

           health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4
          pgs stuck unclean; recovery 9/23842120 degraded (0.000%)

           monmap e1: 3 mons at
          {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},

          election epoch 350, quorum 0,1,2
          da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0

           osdmap e6036: 66 osds: 66 up, 66 in

            pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean,
          9 active+clean+scrubbing, 1 remapped+peering, 3
          active+clean+scrubbing+deep; 45868 GB data, 92006 GB used,
          51297 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120
          degraded (0.000%)

           mdsmap e1: 0/0/1 up

      Here are further details on our stuck pgs:

      jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
          nova pg dump_stuck inactive

        ok

        pg_stat objects mip     degr    unf     bytes   log    
          disklog state   state_stamp     v       reported       
          up      acting  last_scrub   scrub_stamp      last_deep_scrub
          deep_scrub_stamp

        3.3af   11600   0       0       0       47941791744    
          153812  153812  remapped+peering        2015-05-15
          12:47:17.223786      5979'293066  6000'1248735     [48,62]
          [53,48,62]      5979'293056     2015-05-15
          07:40:36.275563      5979'293056     2015-05-15
          07:40:36.275563

        jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
          dump_stuck unclean

        ok

        pg_stat objects mip     degr    unf     bytes   log    
          disklog state   state_stamp     v       reported       
          up      acting  last_scrub   scrub_stamp      last_deep_scrub
          deep_scrub_stamp

        3.106   11870   0       9       0       49010106368    
          163991  163991  active  2015-05-15 12:47:19.761469     
          6035'356332     5968'1358516 [62,53]  [62,53] 5979'356242    
          2015-05-14 22:22:12.966150      5979'351351     2015-05-12
          18:04:41.838686

        5.104   0       0       0       0       0       0      
          0       active  2015-05-15 12:47:19.800676      0'0    
          5968'1615       [62,53] [62,53]       0'0     2015-05-14
          18:43:22.425105      0'0     2015-05-08 10:19:54.938934

        4.105   0       0       0       0       0       0      
          0       active  2015-05-15 12:47:19.801028      0'0    
          5968'1615       [62,53] [62,53]       0'0     2015-05-14
          18:43:04.434826      0'0     2015-05-14 18:43:04.434826

        3.3af   11600   0       0       0       47941791744    
          153812  153812  remapped+peering        2015-05-15
          12:47:17.223786      5979'293066  6000'1248735     [48,62]
          [53,48,62]      5979'293056     2015-05-15
          07:40:36.275563      5979'293056     2015-05-15
          07:40:36.275563

      The servers in the pool are not overloaded.  On the ceph server
      that originally had the nearly full osd, (osd 53), I'm seeing
      entries like this in the osd log:

      2015-05-28 06:25:02.900129 7f2ea8a4f700  0 log
          [WRN] : 6 slow requests, 6 included below; oldest blocked for
          > 1096430.805069 secs

        2015-05-28 06:25:02.900145 7f2ea8a4f700  0 log [WRN] : slow
          request 1096430.805069 seconds old, received at 2015-05-15
          13:51:12.094956: osd_op

        (client.153458.0:274955541 rb.0.d204.238e1f29.0000000844d2
          [sparse-read 3706880~4096] 3.773a03af) v4 currently reached pg

        2015-05-28 06:25:02.900156 7f2ea8a4f700  0 log [WRN] : slow
          request 1095097.235862 seconds old, received at 2015-05-15
          14:13:25.664163: osd_op(client.154124.1:68230513
          rb.0.176f3.238e1f29.000000002cd8 [write 0~524288] 3.4329faf)
          v4 currently reached pg

        2015-05-28 06:25:02.900162 7f2ea8a4f700  0 log [WRN] : slow
          request 1095097.214366 seconds old, received at 2015-05-15
          14:13:25.685659: osd_op(client.154124.1:68230514
          rb.0.176f3.238e1f29.000000002cd8 [write 524288~524288]
          3.4329faf) v4 currently reached pg

        2015-05-28 06:25:02.900169 7f2ea8a4f700  0 log [WRN] : slow
          request 1095097.191725 seconds old, received at 2015-05-15
          14:13:25.708300: osd_op(client.154124.1:68230515
          rb.0.176f3.238e1f29.000000002cd8 [write 1048576~524288]
          3.4329faf) v4 currently reached pg

        2015-05-28 06:25:02.900175 7f2ea8a4f700  0 log [WRN] : slow
          request 1095097.169399 seconds old, received at 2015-05-15
          14:13:25.730626: osd_op(client.154124.1:68230516

        ...

      These happen once per second and the only thing that changes in
      the whole batch of messages is that the seconds old increases.

      According to dmesg, syslog, and bmc there are no current hw
      malfunctions on this machine. 

      I'm assuming that whatever is causing the slow request in the osd
      logs might be gumming up the works.

      What I don't get is how to get my pg out of the stuck inactive
      state.  The doc solution hints don't seem to apply in this case.

      Finally here is the health detail, it sorta repeats the above:

      jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id
          nova osd lspools

          3 rbd,4 data,5 metadata,

          jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova health
          detail

        HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck
          unclean; recovery 9/23842120 degraded (0.000%)

        pg 3.3af is stuck inactive for 1132739.583540, current state
          remapped+peering, last acting [53,48,62]

        pg 3.106 is stuck unclean for 1131876.351202, current state
          active, last acting [62,53]

        pg 5.104 is stuck unclean for 1131876.350931, current state
          active, last acting [62,53]

        pg 4.105 is stuck unclean for 1131876.351729, current state
          active, last acting [62,53]

        pg 3.3af is stuck unclean for 1132739.583697, current state
          remapped+peering, last acting [53,48,62]

        pg 3.3af is remapped+peering, acting [53,48,62]

        recovery 9/23842120 degraded (0.000%)

      I'd like to get to ceph health OK to see if that's contributing to
      the problem.

      Thanks for any pointers.

      ~jpr

      On 05/24/2015 11:02 PM, Christian
        Balzer wrote:

        Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS versions
as well).
The only common thing is that both of you added OSDs and are likely
suffering from delays stemming from Ceph re-balancing or deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for more than 30
seconds, NFS might take offense to lower values (or the accumulation of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new storage nodes
(so could there be something different with those nodes?), how busy is your
network, CPU.
Running something like collectd to gather all ceph perf data and other
data from the storage nodes and then feeding it to graphite (or similar)
can be VERY helpful to identify if something is going wrong and what it is
in particular.
Otherwise run atop on your storage nodes to identify if CPU, network,
specific HDDs/OSDs are bottlenecks. 

Deep scrubbing can be _very_ taxing, do your problems persist if inject
into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that
until it hurts again) or if you turn of deep scrubs altogether for the
moment?

Christian

On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:

          We see something very similar on our Ceph cluster, starting as of today.

We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
OpenStack cluster (we applied the RBD patches for live migration etc)

On this cluster we have a big ownCloud installation (Sync & Share) that
stores its files on three NFS servers, each mounting 6 2TB RBD volumes
and exposing them to around 10 web server VMs (we originally started
with one NFS server with a 100TB volume, but that has become unwieldy).
All of the servers (hypervisors, ceph storage nodes and VMs) are using
Ubuntu 14.04

Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
rebalancing process ended this morning (after around 12 hours) The
cluster has been clean since then:

    cluster b1f3f4c8-xxxxx
     health HEALTH_OK
     monmap e2: 3 mons at
{zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
active+clean 17 active+clean+scrubbing+deep
  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s

At midnight, we run a script that creates an RBD snapshot of all RBD
volumes that are attached to the NFS servers (for backup purposes).
Looking at our monitoring, around that time, one of the NFS servers
became unresponsive and took down the complete ownCloud installation
(load on the web server was > 200 and they had lost some of the NFS
mounts)

Rebooting the NFS server solved that problem, but the NFS kernel server
kept crashing all day long after having run between 10 to 90 minutes.

We initially suspected a corrupt rbd volume (as it seemed that we could
trigger the kernel crash by just “ls -l” one of the volumes, but
subsequent “xfs_repair -n” checks on those RBD volumes showed no
problems.

We migrated the NFS server off of its hypervisor, suspecting a problem
with RBD kernel modules, rebooted the hypervisor but the problem
persisted (both on the new hypervisor, and on the old one when we
migrated it back)

We changed the /etc/default/nfs-kernel-server to start up 256 servers
(even though the defaults had been working fine for over a year)

Only one of our 3 NFS servers crashes (see below for syslog information)
- the other 2 have been fine

May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May
23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0
(16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
[  183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1

            /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
collectd[1872]: python: Plugin loaded but not configured. May 23
21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering
read-loop. May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init:
plymouth-upstart-bridge main process ended, respawning May 23 21:51:26
drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked for
more than 120 seconds.

          May 23 21:51:26 drive-nfs1 kernel: [  600.778090]       Not tainted
3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
[  600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message. May 23 21:51:26 drive-nfs1 kernel: [  600.781504]
nfsd            D ffff88013fd93180     0  1696      2 0x00000000 May 23
21:51:26 drive-nfs1 kernel: [  600.781508]  ffff8800b2391c50
0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
drive-nfs1 kernel: [  600.781511]  0000000000013180 0000000000013180
ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 kernel:
[  600.781513]  ffff880035f48244 ffff8800b22f9800 00000000ffffffff
ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [  600.781515] Call
Trace: May 23 21:51:26 drive-nfs1 kernel: [  600.781523]
[<ffffffff81727749>] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
drive-nfs1 kernel: [  600.781526]  [<ffffffff817295b5>]
__mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 kernel:
[  600.781528]  [<ffffffff8172964f>] mutex_lock+0x1f/0x2f May 23
21:51:26 drive-nfs1 kernel: [  600.781557]  [<ffffffffa03b1761>]
nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
[  600.781568]  [<ffffffffa03b044b>] ? fh_verify+0x14b/0x5e0 [nfsd] May
23 21:51:26 drive-nfs1 kernel: [  600.781591]  [<ffffffffa03b1bb9>]
nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
[  600.781613]  [<ffffffffa03be90a>] nfsd4_lookup+0x1a/0x20 [nfsd] May
23 21:51:26 drive-nfs1 kernel: [  600.781628]  [<ffffffffa03c055a>]
nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
kernel: [  600.781638]  [<ffffffffa03acd3b>] nfsd_dispatch+0xbb/0x200
[nfsd] May 23 21:51:26 drive-nfs1 kernel: [  600.781662]
[<ffffffffa028762d>] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
21:51:26 drive-nfs1 kernel: [  600.781678]  [<ffffffffa0287997>]
svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 kernel:
[  600.781687]  [<ffffffffa03ac71f>] nfsd+0xbf/0x130 [nfsd] May 23
21:51:26 drive-nfs1 kernel: [  600.781696]  [<ffffffffa03ac660>] ?
nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
[  600.781702]  [<ffffffff8108b6b2>] kthread+0xd2/0xf0 May 23 21:51:26
drive-nfs1 kernel: [  600.781707]  [<ffffffff8108b5e0>] ?
kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 kernel:
[  600.781712]  [<ffffffff81733868>] ret_from_fork+0x58/0x90 May 23
21:51:26 drive-nfs1 kernel: [  600.781717]  [<ffffffff8108b5e0>] ?
kthread_create_on_node+0x1c0/0x1c0

Before each crash, we see the disk utilization of one or two random
mounted RBD volumes to go to 100% - there is no pattern on which of the
RBD disks start to act up.

We have scoured the log files of the Ceph cluster for any signs of
problems but came up empty.

The NFS server has almost no load (compared to regular usage) as most
sync clients are either turned off (weekend) or have given up connecting
to the server. 

There haven't been any configuration change on the NFS servers prior to
the problems. The only change was the adding of 23 OSDs.

We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)

Our team is completely out of ideas. We have removed the 100TB volume
from the nfs server (we used the downtime to migrate the last data off
of it to one of the smaller volumes). The NFS server has been running
for 30 minutes now (with close to no load) but we don’t really expect it
to make it until tomorrow.

send help
Jens-Christian

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com