Re: ceph-mgr fails to restart after upgrade to mimic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I upgraded to 13.2.4 and, unsurprisingly, it did not solve the problem. ceph-mgr still fails. What else do I need to look at to try to solve this?

Thanks.

On Fri, Jan 4, 2019 at 3:20 PM Randall Smith <rbsmith@xxxxxxxxx> wrote:
Some more info that may or may not matter. :-) First off, I am running 13.2.3 on Ubuntu Xenial (ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)).

Next, when I try running ceph-mgr with --no-mon-config, the app core dumps.

     0> 2019-01-04 14:56:56.416 7fbcc71db380 -1 /build/ceph-13.2.3/src/common/Timer.cc: In function 'virtual SafeTimer::~SafeTimer()' thread 7fbcc71db380 time 2019-01-04 14:56:56.419012
/build/ceph-13.2.3/src/common/Timer.cc: 50: FAILED assert(thread == __null)
                                                
 ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fbcbe5093c2]
 2: (()+0x2e5587) [0x7fbcbe509587]
 3: (()+0x2e12de) [0x7fbcbe5052de]                                    
 4: (MgrClient::~MgrClient()+0xc4) [0x5594f4]
 5: (MgrStandby::~MgrStandby()+0x14d) [0x55063d]
 6: (main()+0x24b) [0x49446b]                                                 
 7: (__libc_start_main()+0xf0) [0x7fbcbcf51830]
 8: (_start()+0x29) [0x497dc9]   
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---                                                                           
   0/ 5 none                      
   0/ 1 lockdep                              
   0/ 1 context                                 
   1/ 1 crush                                   
   1/ 5 mds                                     
   1/ 5 mds_balancer                                                                                                                                                                                
   1/ 5 mds_locker                                                                                                       
   1/ 5 mds_log                                                                             
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
  10/10 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file
--- end dump of recent events ---
*** Caught signal (Aborted) **
 in thread 7fbcc71db380 thread_name:ceph-mgr
 ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
 1: /usr/bin/ceph-mgr() [0x63ebd0]
 2: (()+0x11390) [0x7fbcbd819390]
 3: (gsignal()+0x38) [0x7fbcbcf66428]
 4: (abort()+0x16a) [0x7fbcbcf6802a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7fbcbe509510]
 6: (()+0x2e5587) [0x7fbcbe509587]
 7: (()+0x2e12de) [0x7fbcbe5052de]
 8: (MgrClient::~MgrClient()+0xc4) [0x5594f4]
 9: (MgrStandby::~MgrStandby()+0x14d) [0x55063d]
 10: (main()+0x24b) [0x49446b]
 11: (__libc_start_main()+0xf0) [0x7fbcbcf51830]
 12: (_start()+0x29) [0x497dc9]
2019-01-04 14:56:56.420 7fbcc71db380 -1 *** Caught signal (Aborted) **
 in thread 7fbcc71db380 thread_name:ceph-mgr

 ceph version 13.2.3 (9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279) mimic (stable)
 1: /usr/bin/ceph-mgr() [0x63ebd0]
 2: (()+0x11390) [0x7fbcbd819390]
 3: (gsignal()+0x38) [0x7fbcbcf66428]
 4: (abort()+0x16a) [0x7fbcbcf6802a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7fbcbe509510]
 6: (()+0x2e5587) [0x7fbcbe509587]
 7: (()+0x2e12de) [0x7fbcbe5052de]
 8: (MgrClient::~MgrClient()+0xc4) [0x5594f4]
 9: (MgrStandby::~MgrStandby()+0x14d) [0x55063d]
 10: (main()+0x24b) [0x49446b]
 11: (__libc_start_main()+0xf0) [0x7fbcbcf51830]
 12: (_start()+0x29) [0x497dc9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


On Fri, Jan 4, 2019 at 1:53 PM Randall Smith <rbsmith@xxxxxxxxx> wrote:
I think this is the relevant section of the debug log. There's no AUTH_NONE error which would make things easy. You can see the same "Invalid argument" error that I'm seeing in the mgr debug output. The malformed request feels like a compatibility or protocol communication issue. 

2019-01-04 13:41:58.972 7f88950f5700 10 mon.07@1(peon) e27 ms_verify_authorizer 192.168.253.148:0/3301807723 client protocol 0                                                                      
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon) e27 _ms_dispatch new session 0x40a58c0 MonSession(client.? 192.168.253.148:0/3301807723 is open , features 0x3ffddff8ffa4fffb (luminous)) fea$ures 0x3ffddff8ffa4fffb
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon).auth v87697 preprocess_query auth(proto 0 26 bytes epoch 0) v1 from client.? 192.168.253.148:0/3301807723                                    
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon).auth v87697 prep_auth() blob_size=26
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon).auth v87697 AuthMonitor::assign_global_id m=auth(proto 0 26 bytes epoch 0) v1 mon=1/3 last_allocated=12307825 max_global_id=12353896         
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon).auth v87697 next_global_id should be 12307828                                                                                                
2019-01-04 13:41:58.972 7f8890143700  2 mon.07@1(peon) e27 send_reply 0x5449180 0x4ee1c00 auth_reply(proto 2 0 (0) Success) v1                                                                      
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon).auth v87697 preprocess_query auth(proto 2 2 bytes epoch 0) v1 from client.? 192.168.253.148:0/3301807723                                     
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon).auth v87697 prep_auth() blob_size=2
2019-01-04 13:41:58.972 7f8890143700  0 mon.07@1(peon).auth v87697 caught error when trying to handle auth request, probably malformed request                                                      
2019-01-04 13:41:58.972 7f8890143700  2 mon.07@1(peon) e27 send_reply 0x30dc500 0x5caa280 auth_reply(proto 2 -22 (22) Invalid argument) v1                                                          
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon) e27 ms_handle_reset 0x4102a00 192.168.253.148:0/3301807723                                                                                   
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon) e27 reset/close on session client.? 192.168.253.148:0/3301807723                                                                             
2019-01-04 13:41:58.972 7f8890143700 10 mon.07@1(peon) e27 remove_session 0x40a58c0 client.? 192.168.253.148:0/3301807723 features 0x3ffddff8ffa4fffb

On Fri, Jan 4, 2019 at 12:32 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
You can also get more data by checking what the monitor logs for that
manager on the connect attempt (if you turn up its debug mon or debug
ms settings). If one of your managers is behaving, I'd examine its
configuration file and compare to the others. For instance, that
"Invalid argument" might mean the manager is trying to use "AUTH_NONE"
(no CephX) and the monitors aren't allowing that.
-Greg

On Fri, Jan 4, 2019 at 6:26 AM Randall Smith <rbsmith@xxxxxxxxx> wrote:
>
> Greetings,
>
> I'm upgrading my cluster from luminous to mimic. I've upgraded my monitors and am attempting to upgrade the mgrs. Unfortunately, after an upgrade the mgr daemon exits immediately with error code 1.
>
> I've tried running ceph-mgr in debug mode to try to see what's happening but the output (below) is a bit cryptic for me. It looks like authentication might be failing but it was working prior to the upgrade.
>
> I do have "auth supported = cephx" in the global section of ceph.conf.
>
> What do I need to do to fix this?
>
> Thanks.
>
> /usr/bin/ceph-mgr -f --cluster ceph --id 8 --setuser ceph --setgroup ceph -d --debug_ms 5
> 2019-01-04 07:01:38.457 7f808f83f700  2 Event(0x30c42c0 nevent=5000 time_id=1).set_owner idx=0 owner=140190140331776
> 2019-01-04 07:01:38.457 7f808f03e700  2 Event(0x30c4500 nevent=5000 time_id=1).set_owner idx=1 owner=140190131939072
> 2019-01-04 07:01:38.457 7f808e83d700  2 Event(0x30c4e00 nevent=5000 time_id=1).set_owner idx=2 owner=140190123546368
> 2019-01-04 07:01:38.457 7f809dd5b380  1  Processor -- start
> 2019-01-04 07:01:38.477 7f809dd5b380  1 -- - start start
> 2019-01-04 07:01:38.481 7f809dd5b380  1 -- - --> 192.168.253.147:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6780 con 0
> 2019-01-04 07:01:38.481 7f809dd5b380  1 -- - --> 192.168.253.148:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6a00 con 0
> 2019-01-04 07:01:38.481 7f808e83d700  1 -- 192.168.253.148:0/1359135487 learned_addr learned my addr 192.168.253.148:0/1359135487
> 2019-01-04 07:01:38.481 7f808e83d700  2 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:6789/0 conn(0x332d500 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=0)._process_connection got newly_a$
> ked_seq 0 vs out_seq 0
> 2019-01-04 07:01:38.481 7f808f03e700  2 -- 192.168.253.148:0/1359135487 >> 192.168.253.147:6789/0 conn(0x332ce00 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=0)._process_connection got newly_a$
> ked_seq 0 vs out_seq 0
> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487 >> 192.168.253.147:6789/0 conn(0x332ce00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 seq
> 1 0x30c5440 mon_map magic: 0 v1
> 2019-01-04 07:01:38.481 7f808e83d700  5 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:6789/0 conn(0x332d500 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2 seq
> 1 0x30c5680 mon_map magic: 0 v1
> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487 >> 192.168.253.147:6789/0 conn(0x332ce00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 seq
> 2 0x32a6780 auth_reply(proto 2 0 (0) Success) v1
> 2019-01-04 07:01:38.481 7f808e83d700  5 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:6789/0 conn(0x332d500 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2 seq
> 2 0x32a6a00 auth_reply(proto 2 0 (0) Success) v1
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== mon.1 192.168.253.147:6789/0 1 ==== mon_map magic: 0 v1 ==== 370+0+0 (3034216899 0 0) 0x30c5440 con 0x332ce00
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== mon.2 192.168.253.148:6789/0 1 ==== mon_map magic: 0 v1 ==== 370+0+0 (3034216899 0 0) 0x30c5680 con 0x332d500
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== mon.1 192.168.253.147:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (3430158761 0 0) 0x32a6780 con 0x33$
> ce00
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 --> 192.168.253.147:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x32a6f00 con 0
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== mon.2 192.168.253.148:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (3242503871 0 0) 0x32a6a00 con 0x33$
> d500
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 --> 192.168.253.148:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x32a6780 con 0
> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487 >> 192.168.253.147:6789/0 conn(0x332ce00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 seq
> 3 0x32a6f00 auth_reply(proto 2 -22 (22) Invalid argument) v1
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== mon.1 192.168.253.147:6789/0 3 ==== auth_reply(proto 2 -22 (22) Invalid argument) v1 ==== 24+0+0 (882932531 0 0) 0x32a6f$
> 0 con 0x332ce00
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 >> 192.168.253.147:6789/0 conn(0x332ce00 :-1 s=STATE_OPEN pgs=74172 cs=1 l=1).mark_down
> 2019-01-04 07:01:38.481 7f808e03c700  2 -- 192.168.253.148:0/1359135487 >> 192.168.253.147:6789/0 conn(0x332ce00 :-1 s=STATE_OPEN pgs=74172 cs=1 l=1)._stop
> 2019-01-04 07:01:38.481 7f808e83d700  5 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:6789/0 conn(0x332d500 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2 seq
> 3 0x32a6780 auth_reply(proto 2 -22 (22) Invalid argument) v1
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== mon.2 192.168.253.148:6789/0 3 ==== auth_reply(proto 2 -22 (22) Invalid argument) v1 ==== 24+0+0 (1359424806 0 0) 0x32a6$
> 80 con 0x332d500
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:6789/0 conn(0x332d500 :-1 s=STATE_OPEN pgs=74275 cs=1 l=1).mark_down
> 2019-01-04 07:01:38.481 7f808e03c700  2 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:6789/0 conn(0x332d500 :-1 s=STATE_OPEN pgs=74275 cs=1 l=1)._stop
>
> 2019-01-04 07:01:38.481 7f809dd5b380  1 -- 192.168.253.148:0/1359135487 shutdown_connections
> 2019-01-04 07:01:38.481 7f809dd5b380  5 -- 192.168.253.148:0/1359135487 shutdown_connections mark down 192.168.253.148:6789/0 0x332d500
> 2019-01-04 07:01:38.481 7f809dd5b380  5 -- 192.168.253.148:0/1359135487 shutdown_connections mark down 192.168.253.147:6789/0 0x332ce00
> 2019-01-04 07:01:38.481 7f809dd5b380  5 -- 192.168.253.148:0/1359135487 shutdown_connections delete 0x332ce00
> 2019-01-04 07:01:38.481 7f809dd5b380  5 -- 192.168.253.148:0/1359135487 shutdown_connections delete 0x332d500
> 2019-01-04 07:01:38.485 7f809dd5b380  1 -- 192.168.253.148:0/1359135487 shutdown_connections
> 2019-01-04 07:01:38.485 7f809dd5b380  1 -- 192.168.253.148:0/1359135487 wait complete.
> 2019-01-04 07:01:38.485 7f809dd5b380  1 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:0/1359135487 conn(0x332c000 :-1 s=STATE_NONE pgs=0 cs=0 l=0).mark_down
> 2019-01-04 07:01:38.485 7f809dd5b380  2 -- 192.168.253.148:0/1359135487 >> 192.168.253.148:0/1359135487 conn(0x332c000 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._stop
> failed to fetch mon config (--no-mon-config to skip)
>
> --
> Randall Smith
> Computing Services
> Adams State University
> http://www.adams.edu/
> 719-587-7741
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Randall Smith
Computing Services
Adams State University
http://www.adams.edu/
719-587-7741


--
Randall Smith
Computing Services
Adams State University
http://www.adams.edu/
719-587-7741


--
Randall Smith
Computing Services
Adams State University
http://www.adams.edu/
719-587-7741
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux