Re: ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 29 Aug 2016 11:30:38 -0700

On Sat, Aug 27, 2016 at 3:01 AM, Francois Lafont
<francois.lafont.1978@xxxxxxxxx> wrote:
> Hi,
>
> I had exactly the same error in my production ceph client node with
> Jewel 10.2.1 in my case.
>
> In the client node :
> - Ubuntu 14.04
> - kernel 3.13.0-92-generic
> - ceph 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
> - cephfs via _ceph-fuse_
>
> In the cluster node :
> - Ubuntu 14.04
> - kernel 3.13.0-92-generic
> - ceph 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>
> It was during the execution of a very basic Python (2.7.6) script which
> makes some os.makedirs(...) and os.chown(...).
>
> Just in case, the logs are below. I'm sorry they are not verbose at all
> and so probably useless for you...
>
> Which settings should I put in my client and cluster configuration to
> have relevant logs if the same error happens again?
>
> Regards.
> François Lafont
>
> Here are the logs:
>
> 1. In the client node: http://francois-lafont.ac-versailles.fr/misc/ceph-client.cephfs.log.1.gz

Ha, yep, that's one of the bugs Giancolo found:

 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (()+0x299152) [0x7f91398dc152]
 2: (()+0x10330) [0x7f9138bbb330]
 3: (Client::get_root_ino()+0x10) [0x7f91397df6c0]
 4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
[0x7f91397dd3d5]
 5: (()+0x19ac09) [0x7f91397ddc09]
 6: (()+0x14b45) [0x7f91391f7b45]
 7: (()+0x1522b) [0x7f91391f822b]
 8: (()+0x11e49) [0x7f91391f4e49]
 9: (()+0x8184) [0x7f9138bb3184]
 10: (clone()+0x6d) [0x7f913752237d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

So I that'll be in the next Jewel release if it's not already fixed in 10.2.2.
-Greg

>
> 2. In the (active) mds node:
>
> ----%<----%<----%<----%<----%<----%<----%<----%<----
> ~$ sudo zcat /var/log/ceph/ceph-mds.ceph02.log.1.gz
> 2016-08-22 15:02:03.799037 7f3f9adc1700  0 -- 10.0.2.102:6800/2186 >> 192.168.23.11:0/431481110 pipe(0x7f3fb3a87400 sd=22 :6800 s=2 pgs=64 cs=1 l=0 c=0x7f3fb5f10900).fault with nothing to send, going to standby
> 2016-08-22 15:02:40.236001 7f3f9f7d3700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 34.503993 secs
> 2016-08-22 15:02:40.236026 7f3f9f7d3700  0 log_channel(cluster) log [WRN] : slow request 34.503993 seconds old, received at 2016-08-22 15:02:05.731897: client_request(client.1442720:650326 getattr pAsLsXsFs #1000001b6d0 2016-08-22 15:02:05.731515) currently failed to rdlock, waiting
> 2016-08-22 15:07:00.245269 7f3f9f7d3700  0 log_channel(cluster) log [INF] : closing stale session client.1433176 192.168.23.11:0/431481110 after 304.132797
> 2016-08-22 15:23:07.970215 7f3f9adc1700  0 -- 10.0.2.102:6800/2186 >> 192.168.23.11:0/2607326748 pipe(0x7f3fff365400 sd=22 :6800 s=2 pgs=8 cs=1 l=0 c=0x7f3fb5f10a80).fault, server, going to standby
> 2016-08-22 15:28:05.281489 7f3f9f7d3700  0 log_channel(cluster) log [INF] : closing stale session client.1537178 192.168.23.11:0/2607326748 after 300.588323
> ----%<----%<----%<----%<----%<----%<----%<----%<----
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com