FYI. I get the same error with an osd too.
-11> 2013-06-25 16:00:37.604042 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.1 172.18.11.30:0/10964 5300 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:37.588367) v2 ==== 47+0+0 (3462129666 0 0) 0x4a0ce00 con 0x4a094a0
-10> 2013-06-25 16:00:37.604075 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:37.588367) v2 -- ?+0 0x47196c0 con 0x4a094a0
-9> 2013-06-25 16:00:37.970605 7f0750e18700 10 monclient: tick
-8> 2013-06-25 16:00:37.970615 7f0750e18700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-06-25 16:00:07.970614)
-7> 2013-06-25 16:00:37.970630 7f0750e18700 10 monclient: renew subs? (now: 2013-06-25 16:00:37.970630; renew after: 2013-06-25 16:02:47.970419) -- no
-6> 2013-06-25 16:00:38.626079 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.9 172.18.11.34:0/1788 4862 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:38.613584) v2 ==== 47+0+0 (4007998759 0 0) 0x4efa540 con 0x4f0c580
-5> 2013-06-25 16:00:38.626117 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.34:0/1788 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:38.613584) v2 -- ?+0 0x4a0ce00 con 0x4f0c580
-4> 2013-06-25 16:00:38.640572 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.0 172.18.11.30:0/10931 5280 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:38.624922) v2 ==== 47+0+0 (350205583 0 0) 0x4acfdc0 con 0x4a09340
-3> 2013-06-25 16:00:38.640606 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10931 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:38.624922) v2 -- ?+0 0x4efa540 con 0x4a09340
-2> 2013-06-25 16:00:39.304307 7f0751f1b700 1 -- 172.18.11.32:6802/1594 <== osd.1 172.18.11.30:0/10964 5301 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:39.288581) v2 ==== 47+0+0 (4084422642 0 0) 0x93b8c40 con 0x4a094a0
-1> 2013-06-25 16:00:39.304354 7f0751f1b700 1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:39.288581) v2 -- ?+0 0x4acfdc0 con 0x4a094a0
0> 2013-06-25 16:00:39.829601 7f074e512700 -1 os/FileStore.cc: In function 'int FileStore::lfn_find(coll_t, const hobject_t&, IndexedPath*)' thread 7f074e512700 time 2013-06-25 16:00:39.792543
os/FileStore.cc: 166: FAILED assert(!m_filestore_fail_eio || r != -5)
ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
1: (FileStore::lfn_find(coll_t, hobject_t const&, std::tr1::shared_ptr<CollectionIndex::Path>*)+0x109) [0x7df319]
2: (FileStore::lfn_stat(coll_t, hobject_t const&, stat*)+0x55) [0x7e1005]
3: (FileStore::stat(coll_t, hobject_t const&, stat*, bool)+0x51) [0x7ef001]
4: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&, bool, ThreadPool::TPHandle&)+0x3d1) [0x76e391]
5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x174) [0x771344]
6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x8a6) [0x772076]
7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbd) [0x70f00d]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68c) [0x8e384c]
9: (ThreadPool::WorkThread::entry()+0x10) [0x8e4af0]
10: (()+0x7f8e) [0x7f0761dc5f8e]
11: (clone()+0x6d) [0x7f0760077e1d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
------------------ Original ------------------
From: "Mike Dawson"<mike.dawson@xxxxxxxxxxxx>;
Date: Wed, Jun 26, 2013 10:50 AM
To: "Darryl Bond"<dbond@xxxxxxxxxxxxx>;
Cc: "ceph-users@xxxxxxxxxxxxxx"<ceph-users@xxxxxxxxxxxxxx>;
Subject: Re: [ceph-users] One monitor won't start after upgrade from 6.1.3to 6.1.4
I've seen this issue a few times recently. I believe Joao was looking
into it at one point, but I don't know if it has been resolved (Any news
Joao?). Others have run into it too. Look closely at:
http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15
I'd recommend you submit this as a bug on the tracker.
It sounds like you have reliable quorum between a and b, that's good.
The workaround that has worked for me is to remove mon.c, then re-add
it. Assuming your monitor leveldb stores aren't too large, the process
is rather quick. Follow the instructions at:
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors
then
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors
- Mike
On 6/25/2013 10:34 PM, Darryl Bond wrote:
> Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had
> been successfully upgraded from bobtail to cuttlefish and then from
> 6.1.2 to 6.1.3. There have been no changes to ceph.conf.
>
> Node mon.a upgrade, a,b,c monitors OK after upgrade
> Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
> available, even though I hadn't touched it)
> Node mon.c very slow to install the upgrade, RAM was tight for some
> reason and mon process was using half the RAM
> Node mon.c shutdown mon.c
> Node mon.c performed the upgrade
> Node mon.c restart ceph - mon.c will not start
>
>
> service ceph start mon.c
>
> === mon.c ===
> Starting Ceph mon.c on ceph3...
> [23992]: (33) Numerical argument out of domain
> failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file
> /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
> Starting ceph-create-keys on ceph3...
>
> health HEALTH_WARN 1 mons down, quorum 0,1 a,b
> monmap e1: 3 mons at
> {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
> election epoch 14224, quorum 0,1 a,b
> osdmap e1342: 18 osds: 18 up, 18 in
> pgmap v4058788: 5448 pgs: 5447 active+clean, 1
> active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
> 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
> mdsmap e1: 0/0/1 up
>
> Set debug mon = 20
> Nothing going into logs other than assertion--- begin dump of recent
> events ---
> 0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
> (Aborted) **
> in thread 7fd5e81b57c0
>
> ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
> 1: /usr/bin/ceph-mon() [0x596fe2]
> 2: (()+0xf000) [0x7fd5e7820000]
> 3: (gsignal()+0x35) [0x7fd5e619fba5]
> 4: (abort()+0x148) [0x7fd5e61a1358]
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
> 6: (()+0x5eeb6) [0x7fd5e6a97eb6]
> 7: (()+0x5eee3) [0x7fd5e6a97ee3]
> 8: (()+0x5f10e) [0x7fd5e6a9810e]
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x64a6aa]
> 10: /usr/bin/ceph-mon() [0x65f916]
> 11: /usr/bin/ceph-mon() [0x6960e9]
> 12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
> 13: (main()+0x1a8a) [0x49786a]
> 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
> 15: /usr/bin/ceph-mon() [0x499a69]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> --- logging levels ---
> 0/ 5 none
> 0/ 1 lockdep
> 0/ 1 context
> 1/ 1 crush
> 1/ 5 mds
> 1/ 5 mds_balancer
> 1/ 5 mds_locker
> 1/ 5 mds_log
> 1/ 5 mds_log_expire
> 1/ 5 mds_migrator
> 0/ 1 buffer
> 0/ 1 timer
> 0/ 1 filer
> 0/ 1 striper
> 0/ 1 objecter
> 0/ 5 rados
> 0/ 5 rbd
> 0/ 5 journaler
> 0/ 5 objectcacher
> 0/ 5 client
> 0/ 5 osd
> 0/ 5 optracker
> 0/ 5 objclass
> 1/ 3 filestore
> 1/ 3 journal
> 0/ 5 ms
> 20/20 mon
> 0/10 monc
> 0/ 5 paxos
> 0/ 5 tp
> 1/ 5 auth
> 1/ 5 crypto
> 1/ 1 finisher
> 1/ 5 heartbeatmap
> 1/ 5 perfcounter
> 1/ 5 rgw
> 1/ 5 hadoop
> 1/ 5 javaclient
> 1/ 5 asok
> 1/ 1 throttle
> -2/-2 (syslog threshold)
> -1/-1 (stderr threshold)
> max_recent 10000
> max_new 1000
> log_file /var/log/ceph/ceph-mon.c.log
> --- end dump of recent events ---
>
>
> The contents of this electronic message and any attachments are intended
> only for the addressee and may contain legally privileged, personal,
> sensitive or confidential information. If you are not the intended
> addressee, and have received this email, any transmission, distribution,
> downloading, printing or photocopying of the contents of this message or
> attachments is strictly prohibited. Any legal privilege or
> confidentiality attached to this message and attachments is not waived,
> lost or destroyed by reason of delivery to any person other than
> intended addressee. If you have received this message and are not the
> intended addressee you should notify the sender by return email and
> destroy all copies of the message and any attachments. Unless expressly
> attributed, the views expressed in this email do not necessarily
> represent the views of the company.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com