Re: Cephfs: large files hang

Bryan Wright <bkw1a@xxxxxxxxxxxx> · Fri, 18 Dec 2015 15:03:22 +0000 (UTC)

Gregory Farnum <gfarnum@...> writes:
> 
> What's the full output of "ceph -s"?
> 
> The only time the MDS issues these "stat" ops on objects is during MDS
> replay, but the bit where it's blocked on "reached_pg" in the OSD
> makes it look like your OSD is just very slow. (Which could
> potentially make the MDS back up far enough to get zapped by the
> monitors, but in that case it's probably some kind of misconfiguration
> issue if they're all hitting it.)
> -Greg
> 

Thanks for the suggestions.  Here's the current messy output of "ceph -s":

    cluster ab8969a6-8b3e-497a-97da-ff06a5476e12
     health HEALTH_WARN
            8 pgs down
            15 pgs incomplete
            15 pgs stuck inactive
            15 pgs stuck unclean
            238 requests are blocked > 32 sec
     monmap e1: 3 mons at
{0=192.168.1.31:6789/0,1=192.168.1.32:6789/0,2=192.168.1.33:6789/0}
            election epoch 42334, quorum 0,1,2 0,1,2
     mdsmap e78771: 1/1/1 up {0=1=up:active}, 2 up:standby, 1
up:oneshot-replay(laggy or crashed)
     osdmap e194472: 58 osds: 58 up, 58 in
      pgmap v12811210: 1464 pgs, 3 pools, 25856 GB data, 8873 kobjects
            52265 GB used, 55591 GB / 105 TB avail
                1447 active+clean
                   8 down+incomplete
                   7 incomplete
                   2 active+clean+scrubbing

The spurious "oneshot-replay" mds entry was caused by a typo in the mds name
when I tried earlier to do a "ceph-mds --journal-check".

I'm currently trying to copy a large file off of the ceph filesystem, and
it's hung after 12582912 kB.  The osd log is telling me things like:

2015-12-18 09:25:22.698124 7f5c0540a700  0 log_channel(cluster) log [WRN] :
slow request 3840.705492 seconds old, received at 2015-12-18
08:21:21.992542: osd_op(mds.0.14959:1257 100010a7ba7.00000000 [create
0~0,setxattr parent (293)] 0.beb25de8 ondisk+write+known_if_redirected
e194470) currently reached_pg

dmesg, etc., show no errors for the osd disk or anything else, and the load
on the osd server is nonexistent:

   09:53:01 up 17:54,  1 user,  load average: 0.05, 0.43, 0.42

When logged into the osd server, I can browse around on the osd's filesystem
with no sluggishness:

ls /var/lib/ceph/osd/ceph-406/current
0.10c_head  0.4d_head   1.164_head  1.a0_head   2.190_head  commit_op_seq
0.10_head   0.57_head   1.18a_head  1.a3_head   2.46_head   meta
0.151_head  0.9a_head   1.18c_head  1.e7_head   2.4b_head   nosnap
0.165_head  0.9f_head   1.191_head  1.f_head    2.55_head   omap
0.18b_head  0.a1_head   1.47_head   2.10a_head  2.9d_head
0.18d_head  0.a4_head   1.4c_head   2.14f_head  2.9f_head
0.192_head  0.e8_head   1.56_head   2.163_head  2.a2_head
0.1b2_head  1.10b_head  1.99_head   2.189_head  2.e6_head
0.48_head   1.150_head  1.9e_head   2.18b_head  2.e_head

ifconfig shows no errors on the osd server (public or cluster network):

eth0      Link encap:Ethernet  HWaddr 00:25:90:67:2A:2C  
          inet addr:192.168.1.23  Bcast:192.168.3.255  Mask:255.255.252.0
          inet6 addr: fe80::225:90ff:fe67:2a2c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:13016012 errors:1 dropped:6 overruns:0 frame:1
          TX packets:12839326 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1515148248 (1.4 GiB)  TX bytes:1533480424 (1.4 GiB)
          Interrupt:16 Memory:fa9e0000-faa00000 

eth1      Link encap:Ethernet  HWaddr 00:25:90:67:2A:2D  
          inet addr:192.168.12.23  Bcast:192.168.15.255  Mask:255.255.252.0
          inet6 addr: fe80::225:90ff:fe67:2a2d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:59263760 errors:0 dropped:18476 overruns:0 frame:0
          TX packets:129010105 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:60511361818 (56.3 GiB)  TX bytes:173505625103 (161.5 GiB)
          Interrupt:17 Memory:faae0000-fab00000 

Snooping with wireshark, I see traffic between osds on the cluster network
and traffic between clients, and osds on the public network.

The "incomplete" pgs are associated with a dead osd that's been removed from
the cluster for a long time (since before the current problem).

I thought this problem might be due to something wrong in the 4.* kernel,
but I've
reverted the ceph cluster back to the kernel that it was using the last time
I'm sure things were working (3.19.3-1.el6.elrepo.x86_64) and the behavior
is the same.

I'm still looking for something that might tell me what's causing the osd
requests to hang.

Bryan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com