On Thu, Feb 15, 2018 at 10:28 AM Cary <dynamic.cary@xxxxxxxxx> wrote:
Hello,
I have enabled debugging on my MONs and OSDs to help troubleshoot
these signature check failures. I was watching ods.4's log and saw
these errors when the signature check failure happened.
2018-02-15 18:06:29.235791 7f8bca7de700 1 --
192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).read_bulk peer
close file descriptor 81
2018-02-15 18:06:29.235832 7f8bca7de700 1 --
192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).read_until read
failed
2018-02-15 18:06:29.235841 7f8bca7de700 1 --
192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).process read
tag failed
2018-02-15 18:06:29.235848 7f8bca7de700 1 --
192.168.173.44:6806/72264 >> 192.168.173.42:0/4264467021
conn(0x55f802746000 :6806 s=STATE_OPEN pgs=7 cs=1 l=1).fault on lossy
channel, failing
2018-02-15 18:06:29.235966 7f8bc0853700 2 osd.8 27498 ms_handle_reset
con 0x55f802746000 session 0x55f8063b3180
Could someone please look at this? We have 3 different Ceph clusters
setup and they all have this issue. This cluster is running Gentoo and
Ceph version 12.2.2-r1. The other two clusters are 12.2.2. Exporting
images causes signature check failures and with larger files it seg
faults as well.
When exporting the image from osd.4 This message shows up as well.
Exporting image: 1% complete...2018-02-15 18:14:05.283708 7f6834277700
0 -- 192.168.173.44:0/122241099 >> 192.168.173.44:6801/72152
conn(0x7f681400ff10 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
The error below show up on all OSD/MGR/MON nodes when exporting an image.
Exporting image: 8% complete...2018-02-15 18:15:51.419437 7f2b64ac0700
0 SIGN: MSG 28 Message signature does not match contents.
2018-02-15 18:15:51.419459 7f2b64ac0700 0 SIGN: MSG 28Signature on message:
2018-02-15 18:15:51.419460 7f2b64ac0700 0 SIGN: MSG 28 sig:
8338581684421737157
2018-02-15 18:15:51.419469 7f2b64ac0700 0 SIGN: MSG 28Locally
calculated signature:
2018-02-15 18:15:51.419470 7f2b64ac0700 0 SIGN: MSG 28
sig_check:5913182128308244
2018-02-15 18:15:51.419471 7f2b64ac0700 0 Signature failed.
2018-02-15 18:15:51.419472 7f2b64ac0700 0 --
192.168.173.44:0/3919097436 >> 192.168.173.44:6801/72152
conn(0x7f2b4800ff10 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH
pgs=39 cs=1 l=1).process Signature check failed
Our VMs crash when writing to disk. Libvirt's logs just say the VM
crashed. This is a blocker. Has anyone else seen this? This seems to
be an issue with Ceph Luminous, as we were not having these problem
with Jewel.
When I search through my email, the only two reports of failed signatures are people who in fact had misconfiguration issues resulting in one end using signatures and the other side not.
Given that, and since you're on Gentoo and presumably compiled the packages yourself, the most likely explanation I can think of is something that went wrong between your packages and the compilation. :/
I guess you could try switching from libnss to libcryptopp (or vice versa) by recompiling with the relevant makeflags if you want to do something that only involves the Ceph code. Otherwise, do a rebuild?
Sadly I don't think there's much else we can suggest given that nobody has seen this with binary packages blessed by the upstream or a distribution.
-Greg
Cary
-Dynamic
On Thu, Feb 1, 2018 at 7:04 PM, Cary <dynamic.cary@xxxxxxxxx> wrote:
> Hello,
>
> I did not do anything special that I know of. I was just exporting an
> image from Openstack. We have recently upgraded from Jewel 10.2.3 to
> Luminous 12.2.2.
>
> Caps for admin:
> client.admin
> key: CENSORED
> auid: 0
> caps: [mgr] allow *
> caps: [mon] allow *
> caps: [osd] allow *
>
> Caps for Cinder:
> client.cinder
> key: CENSORED
> caps: [mgr] allow r
> caps: [mon] profile rbd, allow command "osd blacklist"
> caps: [osd] profile rbd pool=vms, profile rbd pool=volumes,
> profile rbd pool=images
>
> Caps for MGR:
> mgr.0
> key: CENSORED
> caps: [mon] allow *
>
> I believe this is causing the virtual machines we have running to
> crash. Any advice would be appreciated. Please let me know if I need
> to provide any other details. Thank you,
>
> Cary
> -Dynamic
>
> On Mon, Jan 29, 2018 at 7:53 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Fri, Jan 26, 2018 at 12:14 PM Cary <dynamic.cary@xxxxxxxxx> wrote:
>>>
>>> Hello,
>>>
>>> We are running Luminous 12.2.2. 6 OSD hosts with 12 1TB OSDs, and 64GB
>>> RAM. Each host has a SSD for Bluestore's block.wal and block.db.
>>> There are 5 monitor nodes as well with 32GB RAM. All servers have
>>> Gentoo with kernel, 4.12.12-gentoo.
>>>
>>> When I export an image using:
>>> rbd export pool-name/volume-name /location/image-name.raw
>>>
>>> Message similar to below are displayed. The signature check fails
>>> randomly. And sometimes a message about a bad authorizer, but not
>>> everytime.
>>> The image is still exported successfully.
>>>
>>> 2018-01-24 17:35:15.616080 7fc8d4024700 0 cephx:
>>> verify_authorizer_reply bad nonce got 4552544084014661633 expected
>>> 4552499520046621785 sent 4552499520046621784
>>> 2018-01-24 17:35:15.616098 7fc8d4024700 0 --
>>> 172.21.32.16:0/1412094654 >> 172.21.32.6:6802/6219 conn(0x7fc8b0078a50
>>> :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
>>> l=1)._process_connection failed verifying authorize reply
>>> 2018-01-24 17:35:15.699004 7fc8d4024700 0 SIGN: MSG 2 Message
>>> signature does not match contents.
>>> 2018-01-24 17:35:15.699020 7fc8d4024700 0 SIGN: MSG 2Signature on
>>> message:
>>> 2018-01-24 17:35:15.699021 7fc8d4024700 0 SIGN: MSG 2 sig:
>>> 8189090775647585001
>>> 2018-01-24 17:35:15.699047 7fc8d4024700 0 SIGN: MSG 2Locally
>>> calculated signature:
>>> 2018-01-24 17:35:15.699048 7fc8d4024700 0 SIGN: MSG 2
>>> sig_check:140500325643792
>>> 2018-01-24 17:35:15.699049 7fc8d4024700 0 Signature failed.
>>> 2018-01-24 17:35:15.699050 7fc8d4024700 0 --
>>> 172.21.32.16:0/1412094654 >> 172.21.32.2:6807/153106
>>> conn(0x7fc8bc020870 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH
>>> pgs=26018 cs=1 l=1).process Signature check failed
>>>
>>> Does anyone know what could cause this, and what I can do to fix it.
>>
>>
>> That's in the cephx authentication code and it's indicating that the secure
>> signature sent with the message isn't what the local node thinks it should
>> be. That's pretty odd (a bit flip or something that could actually change it
>> ought to trigger the messaging checksums directly) and I'm not quite sure
>> how it could happen.
>>
>> But, as you've noticed, it retries and apparently succeeds. How did you
>> notice this?
>> -Greg
>>
>>>
>>>
>>> Thank you,
>>>
>>> Cary
>>> -Dynamic
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com