Re: Monitor segfault

Eino Tuominen <eino@xxxxxx> · Mon, 14 Sep 2015 10:21:03 +0000

Hello,

I'm pretty sure I did it just like you were trying to do. The cluster has since been upgraded a couple of times. Unfortunately I can't remember when I created that particular faulty rule.

-- 
  Eino Tuominen

> Kefu Chai <kchai@xxxxxxxxxx> kirjoitti 14.9.2015 kello 11.57:
> 
> Eino,
> 
> ----- Original Message -----
>> From: "Gregory Farnum" <gfarnum@xxxxxxxxxx>
>> To: "Eino Tuominen" <eino@xxxxxx>
>> Cc: ceph-users@xxxxxxxx, "Kefu Chai" <kchai@xxxxxxxxxx>, joao@xxxxxxx
>> Sent: Monday, August 31, 2015 4:45:40 PM
>> Subject: Re:  Monitor segfault
>> 
>>> On Mon, Aug 31, 2015 at 9:33 AM, Eino Tuominen <eino@xxxxxx> wrote:
>>> Hello,
>>> 
>>> I'm getting a segmentation fault error from the monitor of our test
>>> cluster. The cluster was in a bad state because I have recently removed
>>> three hosts from it. Now I started cleaning it up and first marked the
>>> removed osd's as lost (ceph osd lost), and then I tried to remove the
>>> osd's from the crush map (ceph osd crush remove). After a few successful
>>> commands the cluster ceased to respond. On monitor seemed to stay up (it
> 
> Eino, i was looking at your issue at http://tracker.ceph.com/issues/12876.
> seems it is due to a fault crush rule,  see http://tracker.ceph.com/issues/12876#note-5.
> may i know how you managed to inject it into the monitor? i tried using
> 
> $ ceph osd setcrushmap -i new-crush-map
> Error EINVAL: Failed to parse crushmap: *** Caught signal (Segmentation fault) **
> 
> but no luck.
> 
>>> was responding through the admin socket), so I stopped it and used
>>> monmaptool to remove the failed monitor from the monmap. But, now also the
>>> second monitor segfaults when I try to start it.
>>> 
>>> The cluster does not have any important data, but I'd like to get the
>>> monitors up as a practice. How do I debug this further?
>>> 
>>> Linux cephmon-test-02 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00
>>> UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>> 
>>> The output:
>>> 
>>> -2> 2015-08-31 10:28:52.606894 7f8ab493c8c0  0 log_channel(cluster) log
>>> [INF] : pgmap v1845959: 6288 pgs: 55 inactive, 153 active, 473
>>> active+clean, 1 stale+active+undersized+degraded+remapped, 455
>>> stale+incomplete, 272 peering, 145 stale+down+peering, 6
>>> degraded+remapped, 1 active+recovery_wait+degraded, 70
>>> undersized+degraded+remapped, 504 incomplete, 206
>>> active+undersized+degraded+remapped, 2 stale+active+clean+inconsistent,
>>> 101 down+peering, 59 active+undersized+degraded+remapped+backfilling, 294
>>> remapped, 11 active+undersized+degraded+remapped+wait_backfill, 1264
>>> active+remapped, 5 stale+undersized+degraded, 1
>>> active+undersized+remapped, 1 stale+active+undersized+degraded, 23
>>> stale+remapped+incomplete, 297 remapped+peering, 1
>>> active+remapped+wait_backfill, 1 degraded, 32 undersized+degraded, 454
>>> active+undersized+degraded, 7 active+recovery_wait+degraded+remapped,
>>> 1134 stale+active+clean, 142 remapped+incomplete, 115 stale+peering, 3
>>> active+recovering+degraded+remapped;
>>>  10014 GB data, 5508 GB used, 41981 GB / 47489 GB avail; 33343/19990223
>>>  objects degraded (0.167%); 45721/19990223 objects misplaced (0.229%)
>>>    -1> 2015-08-31 10:28:52.606969 7f8ab493c8c0  0 log_channel(cluster) log
>>>    [INF] : mdsmap e1: 0/0/1 up
>>>     0> 2015-08-31 10:28:52.617974 7f8ab493c8c0 -1 *** Caught signal
>>>     (Segmentation fault) **
>>> in thread 7f8ab493c8c0
>>> 
>>> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>> 1: /usr/bin/ceph-mon() [0x9a98aa]
>>> 2: (()+0x10340) [0x7f8ab3a3d340]
>>> 3: (crush_do_rule()+0x292) [0x85ada2]
>>> 4: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int,
>>> std::allocator<int> >*, int*, unsigned int*) const+0xeb) [0x7a85cb]
>>> 5: (OSDMap::pg_to_raw_up(pg_t, std::vector<int, std::allocator<int> >*,
>>> int*) const+0x94) [0x7a8a64]
>>> 6: (OSDMap::remove_redundant_temporaries(CephContext*, OSDMap const&,
>>> OSDMap::Incremental*)+0x317) [0x7ab8f7]
>>> 7: (OSDMonitor::create_pending()+0xf69) [0x60fdb9]
>>> 8: (PaxosService::_active()+0x709) [0x6047b9]
>>> 9: (PaxosService::election_finished()+0x67) [0x604ad7]
>>> 10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>,
>>> std::allocator<int> >&, unsigned long, MonCommand const*, int,
>>> std::set<int, std::less<int>, std::allocator<int> > const*)
>>> +0x236) [0x5c34a6]
>>> 11: (Monitor::win_standalone_election()+0x1cc) [0x5c388c]
>>> 12: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
>>> 13: (Monitor::init()+0xd5) [0x5c4645]
>>> 14: (main()+0x2470) [0x5769c0]
>>> 15: (__libc_start_main()+0xf5) [0x7f8ab1ec7ec5]
>>> 16: /usr/bin/ceph-mon() [0x5984f7]
>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>>> to interpret this.
>> 
>> Can you get a core dump, open it in gdb, and provide the output of the
>> "backtrace" command?
>> 
>> The cluster is for some reason trying to create new PGs and something
>> is going wrong; I suspect the monitors aren't handling the loss of PGs
>> properly. :/
>> -Greg
>> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com