Re: One monitor won't start after upgrade from 6.1.3to 6.1.4

"Da Chun" <ngugc@xxxxxx> · Wed, 26 Jun 2013 10:59:22 +0800

FYI. I get the same error with an osd too.

   -11> 2013-06-25 16:00:37.604042 7f0751f1b700  1 -- 172.18.11.32:6802/1594 <== osd.1 172.18.11.30:0/10964 5300 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:37.588367) v2 ==== 47+0+0 (3462129666 0 0) 0x4a0ce00 con 0x4a094a0
   -10> 2013-06-25 16:00:37.604075 7f0751f1b700  1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:37.588367) v2 -- ?+0 0x47196c0 con 0x4a094a0
    -9> 2013-06-25 16:00:37.970605 7f0750e18700 10 monclient: tick
    -8> 2013-06-25 16:00:37.970615 7f0750e18700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-06-25 16:00:07.970614)
    -7> 2013-06-25 16:00:37.970630 7f0750e18700 10 monclient: renew subs? (now: 2013-06-25 16:00:37.970630; renew after: 2013-06-25 16:02:47.970419) -- no
    -6> 2013-06-25 16:00:38.626079 7f0751f1b700  1 -- 172.18.11.32:6802/1594 <== osd.9 172.18.11.34:0/1788 4862 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:38.613584) v2 ==== 47+0+0 (4007998759 0 0) 0x4efa540 con 0x4f0c580
    -5> 2013-06-25 16:00:38.626117 7f0751f1b700  1 -- 172.18.11.32:6802/1594 --> 172.18.11.34:0/1788 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:38.613584) v2 -- ?+0 0x4a0ce00 con 0x4f0c580
    -4> 2013-06-25 16:00:38.640572 7f0751f1b700  1 -- 172.18.11.32:6802/1594 <== osd.0 172.18.11.30:0/10931 5280 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:38.624922) v2 ==== 47+0+0 (350205583 0 0) 0x4acfdc0 con 0x4a09340
    -3> 2013-06-25 16:00:38.640606 7f0751f1b700  1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10931 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:38.624922) v2 -- ?+0 0x4efa540 con 0x4a09340
    -2> 2013-06-25 16:00:39.304307 7f0751f1b700  1 -- 172.18.11.32:6802/1594 <== osd.1 172.18.11.30:0/10964 5301 ==== osd_ping(ping e2200 stamp 2013-06-25 16:00:39.288581) v2 ==== 47+0+0 (4084422642 0 0) 0x93b8c40 con 0x4a094a0
    -1> 2013-06-25 16:00:39.304354 7f0751f1b700  1 -- 172.18.11.32:6802/1594 --> 172.18.11.30:0/10964 -- osd_ping(ping_reply e2200 stamp 2013-06-25 16:00:39.288581) v2 -- ?+0 0x4acfdc0 con 0x4a094a0
     0> 2013-06-25 16:00:39.829601 7f074e512700 -1 os/FileStore.cc: In function 'int FileStore::lfn_find(coll_t, const hobject_t&, IndexedPath*)' thread 7f074e512700 time 2013-06-25 16:00:39.792543
os/FileStore.cc: 166: FAILED assert(!m_filestore_fail_eio || r != -5)

 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
 1: (FileStore::lfn_find(coll_t, hobject_t const&, std::tr1::shared_ptr<CollectionIndex::Path>*)+0x109) [0x7df319]
 2: (FileStore::lfn_stat(coll_t, hobject_t const&, stat*)+0x55) [0x7e1005]
 3: (FileStore::stat(coll_t, hobject_t const&, stat*, bool)+0x51) [0x7ef001]
 4: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&, bool, ThreadPool::TPHandle&)+0x3d1) [0x76e391]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x174) [0x771344]
 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x8a6) [0x772076]
 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbd) [0x70f00d]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68c) [0x8e384c]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x8e4af0]
 10: (()+0x7f8e) [0x7f0761dc5f8e]
 11: (clone()+0x6d) [0x7f0760077e1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

------------------ Original ------------------
From:  "Mike Dawson"<mike.dawson@xxxxxxxxxxxx>;
Date:  Wed, Jun 26, 2013 10:50 AM
To:  "Darryl Bond"<dbond@xxxxxxxxxxxxx>; 
Cc:  "ceph-users@xxxxxxxxxxxxxx"<ceph-users@xxxxxxxxxxxxxx>; 
Subject:  Re: [ceph-users] One monitor won't start after upgrade from 6.1.3to 6.1.4

Darryl,

I've seen this issue a few times recently. I believe Joao was looking 
into it at one point, but I don't know if it has been resolved (Any news 
Joao?). Others have run into it too. Look closely at:

http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good. 
The workaround that has worked for me is to remove mon.c, then re-add 
it. Assuming your monitor leveldb stores aren't too large, the process 
is rather quick. Follow the instructions at:

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors

then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors

- Mike

On 6/25/2013 10:34 PM, Darryl Bond wrote:
> Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
> been successfully upgraded from bobtail to cuttlefish and then from
> 6.1.2 to 6.1.3. There have been no changes to ceph.conf.
>
> Node mon.a upgrade, a,b,c monitors OK after upgrade
> Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
> available, even though I hadn't touched it)
> Node mon.c very slow to install the upgrade, RAM was tight for some
> reason and mon process was using half the RAM
> Node mon.c shutdown mon.c
> Node mon.c performed the upgrade
> Node mon.c restart ceph - mon.c will not start
>
>
> service ceph start mon.c
>
> === mon.c ===
> Starting Ceph mon.c on ceph3...
> [23992]: (33) Numerical argument out of domain
> failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
> /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
> Starting ceph-create-keys on ceph3...
>
>     health HEALTH_WARN 1 mons down, quorum 0,1 a,b
>     monmap e1: 3 mons at
> {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
> election epoch 14224, quorum 0,1 a,b
>     osdmap e1342: 18 osds: 18 up, 18 in
>      pgmap v4058788: 5448 pgs: 5447 active+clean, 1
> active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
> 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
>     mdsmap e1: 0/0/1 up
>
> Set debug mon = 20
> Nothing going into logs other than assertion--- begin dump of recent
> events ---
>       0> 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
> (Aborted) **
>   in thread 7fd5e81b57c0
>
>   ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
>   1: /usr/bin/ceph-mon() [0x596fe2]
>   2: (()+0xf000) [0x7fd5e7820000]
>   3: (gsignal()+0x35) [0x7fd5e619fba5]
>   4: (abort()+0x148) [0x7fd5e61a1358]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
>   6: (()+0x5eeb6) [0x7fd5e6a97eb6]
>   7: (()+0x5eee3) [0x7fd5e6a97ee3]
>   8: (()+0x5f10e) [0x7fd5e6a9810e]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x64a6aa]
>   10: /usr/bin/ceph-mon() [0x65f916]
>   11: /usr/bin/ceph-mon() [0x6960e9]
>   12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
>   13: (main()+0x1a8a) [0x49786a]
>   14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
>   15: /usr/bin/ceph-mon() [0x499a69]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> --- logging levels ---
>     0/ 5 none
>     0/ 1 lockdep
>     0/ 1 context
>     1/ 1 crush
>     1/ 5 mds
>     1/ 5 mds_balancer
>     1/ 5 mds_locker
>     1/ 5 mds_log
>     1/ 5 mds_log_expire
>     1/ 5 mds_migrator
>     0/ 1 buffer
>     0/ 1 timer
>     0/ 1 filer
>     0/ 1 striper
>     0/ 1 objecter
>     0/ 5 rados
>     0/ 5 rbd
>     0/ 5 journaler
>     0/ 5 objectcacher
>     0/ 5 client
>     0/ 5 osd
>     0/ 5 optracker
>     0/ 5 objclass
>     1/ 3 filestore
>     1/ 3 journal
>     0/ 5 ms
>    20/20 mon
>     0/10 monc
>     0/ 5 paxos
>     0/ 5 tp
>     1/ 5 auth
>     1/ 5 crypto
>     1/ 1 finisher
>     1/ 5 heartbeatmap
>     1/ 5 perfcounter
>     1/ 5 rgw
>     1/ 5 hadoop
>     1/ 5 javaclient
>     1/ 5 asok
>     1/ 1 throttle
>    -2/-2 (syslog threshold)
>    -1/-1 (stderr threshold)
>    max_recent     10000
>    max_new         1000
>    log_file /var/log/ceph/ceph-mon.c.log
> --- end dump of recent events ---
>
>
> The contents of this electronic message and any attachments are intended
> only for the addressee and may contain legally privileged, personal,
> sensitive or confidential information. If you are not the intended
> addressee, and have received this email, any transmission, distribution,
> downloading, printing or photocopying of the contents of this message or
> attachments is strictly prohibited. Any legal privilege or
> confidentiality attached to this message and attachments is not waived,
> lost or destroyed by reason of delivery to any person other than
> intended addressee. If you have received this message and are not the
> intended addressee you should notify the sender by return email and
> destroy all copies of the message and any attachments. Unless expressly
> attributed, the views expressed in this email do not necessarily
> represent the views of the company.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com