Re: Random CephFS freeze, osd bad authorize reply

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ilya, hi Gregory,
 
all hosts/clients run proper NTP. Still, it could be that if hwclock of those machines has significant drift, so after client boot-up in the morning time is quite far off until NTP gets clock resynced. Maybe that offset drift of NTP resync is causing the issue. I'll have a look into those machines log files to see if they might have a clock skew after boot-up.
 
How much of an offset would be enough to trigger such issues so that CephFS freezes indefinitely (to make that clear, it doesn't freeze for a couple of seconds, it freezes indefenitely or for hours at least).
 
>The ceph messenger equivalent for this error is "failed verifying
>authorize reply". If you search for that, most of the reports are
>indeed clock skews.
 
Ilya, where am I supposed to find the "ceph messenger equivalent" which shows me what kind of error causes my auth issues, i.e. prove that its clock skew related? Couldn't find anything useful in the OSDs logs.
 
Anything else I could do to find the root cause of this?
 
Thanks,
Tobi
 
 
Gesendet: Montag, 24. Juli 2017 um 19:31 Uhr
Von: "Ilya Dryomov" <idryomov@xxxxxxxxx>
An: topro@xxxxxx
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Betreff: Re: Random CephFS freeze, osd bad authorize reply
On Mon, Jul 24, 2017 at 6:35 PM, <topro@xxxxxx> wrote:
> Hi,
>
> I'm running a Ceph cluster which I started back in bobtail age and kept it
> running/upgrading over the years. It has three nodes, each running one MON,
> 10 OSDs and one MDS. The cluster has one MDS active and two standby.
> Machines are 8-core Opterons with 32GB of ECC RAM each. I'm using it to host
> our clients (about 25) /home using CephFS and as a RBD Backend for a couple
> of libvirt VMs (about 5).
>
> Currently I'm running 11.2.0 (kraken) and a couple of month ago I started
> experiencing some strange behaviour. Exactly 2 of my ~25 CephFS Clients
> (always the same two) keep freezing their /home about 1 or two hours after
> first boot in the morning. At the moment of freeze, syslog starts reporting
> loads of:
>
> _hostname_ kernel: libceph: osdXX 172.16.0.XXX:68XX bad authorize reply
>
> On one of the clients I replaced every single piece of hardware with new
> hardware, so that machine is completely replaced now including NIC, Switch,
> Network-Cabling and did a complete OS reinstall. But the user is still
> getting that behaviour. As far as I could get, it seems that key
> renegotiation is failing and client tries to keep connecting with old cephx
> key. But I cannot find a reason for why this is happening and how to fix it.
>
> Biggest problem, the second affected machine is the one of our CEO and if we
> won't fix it I will have a hard time explaining that Ceph is the way to go.
>
> The two affected machines do not share any common piece of network segment
> other than TOR-Switch in Ceph Rack, while there are other clients that do
> share network segment with affected machines but arent affected at all.
>
> Google won't help me either on this one, seems no one else is experiencing
> something similar.
>
> Client setup on all clients is Debian Jessie with 4.9 Backports kernel,
> using kernel client for mounting CephFS. I think the whole thing started
> with a kernel upgrade from one 4.X series to another, but cannout
> reconstruct.

This check was merged into 4.10 and backported to various stable
series, including 4.9 (4.9.2, I think). That explains why you started
seeing it.

The ceph messenger equivalent for this error is "failed verifying
authorize reply". If you search for that, most of the reports are
indeed clock skews.

Thanks,

Ilya
 
 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux