Re: Random CephFS freeze, osd bad authorize reply

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 24, 2017 at 6:35 PM,  <topro@xxxxxx> wrote:
> Hi,
>
> I'm running a Ceph cluster which I started back in bobtail age and kept it
> running/upgrading over the years. It has three nodes, each running one MON,
> 10 OSDs and one MDS. The cluster has one MDS active and two standby.
> Machines are 8-core Opterons with 32GB of ECC RAM each. I'm using it to host
> our clients (about 25) /home using CephFS and as a RBD Backend for a couple
> of libvirt VMs (about 5).
>
> Currently I'm running 11.2.0 (kraken) and a couple of month ago I started
> experiencing some strange behaviour. Exactly 2 of my ~25 CephFS Clients
> (always the same two) keep freezing their /home about 1 or two hours after
> first boot in the morning. At the moment of freeze, syslog starts reporting
> loads of:
>
> _hostname_ kernel: libceph: osdXX 172.16.0.XXX:68XX bad authorize reply
>
> On one of the clients I replaced every single piece of hardware with new
> hardware, so that machine is completely replaced now including NIC, Switch,
> Network-Cabling and did a complete OS reinstall. But the user is still
> getting that behaviour. As far as I could get, it seems that key
> renegotiation is failing and client tries to keep connecting with old cephx
> key. But I cannot find a reason for why this is happening and how to fix it.
>
> Biggest problem, the second affected machine is the one of our CEO and if we
> won't fix it I will have a hard time explaining that Ceph is the way to go.
>
> The two affected machines do not share any common piece of network segment
> other than TOR-Switch in Ceph Rack, while there are other clients that do
> share network segment with affected machines but arent affected at all.
>
> Google won't help me either on this one, seems no one else is experiencing
> something similar.
>
> Client setup on all clients is Debian Jessie with 4.9 Backports kernel,
> using kernel client for mounting CephFS. I think the whole thing started
> with a kernel upgrade from one 4.X series to another, but cannout
> reconstruct.

This check was merged into 4.10 and backported to various stable
series, including 4.9 (4.9.2, I think).  That explains why you started
seeing it.

The ceph messenger equivalent for this error is "failed verifying
authorize reply".  If you search for that, most of the reports are
indeed clock skews.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux