Fwd: [ceph-users] Ceph cluster and rbd (communication trouble?)

Matthijs Möhlmann <matthijs@xxxxxxxxxxxx> · Tue, 18 Jun 2013 23:37:16 +0200

Hello all,

I'm forwarding this message to ceph-devel@ to see if someone here has an
answer to my problem.

Is there a way to debug this issue? And if you do need more information
please don't hesitate to ask.

One more note:
The messages below in the osd logs are coming in at a rate of few
messages per minute.

Please, keep me in the CC as I'm not subscribed to this list.

Regards,

Matthijs Möhlmann
Cacholong

-------- Original Message --------
Subject: [ceph-users] Ceph cluster and rbd (communication trouble?)
Date: Fri, 14 Jun 2013 13:41:39 +0200
From: Matthijs Möhlmann <matthijs@xxxxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx

Hello all,

First of all:
Thanks for this wonderfull piece of software. This way we are able to
truly have redundant storage.

First the situation:
We have currently running 1 Storage server (running 8 OSD's), this is
the only storage server we currently have and will be expanded by one
more in the near future. This server has also 1 MON running.
We also have 2 VM servers (running xen). Each of these servers have one
MON server running.

We are using RBD for the block devices which get attached by xen to the
VPS. After that the VPS can startup and do its thing.

Now the logs of the VM server are showing the following:
[188542.746229] libceph: osd1 10.150.150.10:6804 socket closed (con
state OPEN)
[188542.747963] rbd: obj_request ffff88010ab8c2c0 was already done
[188542.747963]
[188547.758064] libceph: osd1 10.150.150.10:6804 socket closed (con
state OPEN)
[188547.758940] libceph: osd1 10.150.150.10:6804 socket error on read
[188548.671066] rbd: obj_request ffff88010ab8c2c0 was already done
[188548.671066]

Looking at the osd log (in this case osd1):
2013-06-14 13:31:24.038029 7f7ba8d89700  0 bad crc in data 1404955113 !=
exp 3510295870
2013-06-14 13:31:24.038599 7f7ba8d89700  0 -- 10.150.150.10:6804/5205 >>
10.150.150.101:0/1542084666 pipe(0x6021400 sd=32 :6804 s=0 pgs=0 cs=0
l=0).accept peer addr is really 10.150.150.101:0/1542084666 (socket is
10.150.150.101:41501/0)
2013-06-14 13:31:24.038714 7f7ba8d89700  0 auth: could not find secret_id=0
2013-06-14 13:31:24.038725 7f7ba8d89700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=0
2013-06-14 13:31:24.038731 7f7ba8d89700  0 -- 10.150.150.10:6804/5205 >>
10.150.150.101:0/1542084666 pipe(0x6021400 sd=32 :6804 s=0 pgs=0 cs=0
l=1).accept: got bad authorizer
2013-06-14 13:31:29.049213 7f7ba8d89700  0 bad crc in data 1760589740 !=
exp 3270696062
2013-06-14 13:31:29.049976 7f7ba8d89700  0 -- 10.150.150.10:6804/5205 >>
10.150.150.101:0/1542084666 pipe(0x799b180 sd=32 :6804 s=0 pgs=0 cs=0
l=0).accept peer addr is really 10.150.150.101:0/1542084666 (socket is
10.150.150.101:41502/0)
2013-06-14 13:31:29.050113 7f7ba8d89700  0 auth: could not find secret_id=0
2013-06-14 13:31:29.050124 7f7ba8d89700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=0
2013-06-14 13:31:29.050129 7f7ba8d89700  0 -- 10.150.150.10:6804/5205 >>
10.150.150.101:0/1542084666 pipe(0x799b180 sd=32 :6804 s=0 pgs=0 cs=0
l=1).accept: got bad authorizer
2013-06-14 13:31:29.961212 7f7ba8d89700  0 -- 10.150.150.10:6804/5205 >>
10.150.150.101:0/1542084666 pipe(0x799af00 sd=32 :6804 s=0 pgs=0 cs=0
l=0).accept peer addr is really 10.150.150.101:0/1542084666 (socket is
10.150.150.101:41503/0)

This happens to all the OSD servers running on the storage machine, so
it is not only this one causing trouble.

Checking the health of ceph:
root@vms02:~# ceph status
   health HEALTH_OK
   monmap e1: 3 mons at
{a=10.150.150.10:6789/0,b=10.150.150.102:6789/0,c=10.150.150.101:6789/0}, election
epoch 24, quorum 0,1,2 a,b,c
   osdmap e95: 8 osds: 8 up, 8 in
    pgmap v39247: 592 pgs: 592 active+clean; 19870 MB data, 119 GB used,
14773 GB / 14892 GB avail; 11448B/s wr, 0op/s
   mdsmap e1: 0/0/1 up

Obviously something is causing the warnings / errors but I don't know
what. Can someone help me in understanding what's going on here?

All our servers are running Debian Wheezy with all updates applied.
We're using the ceph sources from ceph.com to install ceph.

The servers have all the same kernel and software versions installed:
Linux kernel version:
Linux vms02 3.9-1-amd64 #1 SMP Debian 3.9.4-1 x86_64 GNU/Linux

Version of ceph:
0.61.3-1~bpo70+1

Cephx enabled.

Let me know if you need more information.

Regards,

Matthijs Möhlmann
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html