Re: Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Wed, 26 Jul 2017 15:53:51 +0200 (CEST)

Hi Phil,

It's possible that rocksdb have a bug with some old cpus currently (old xeon and some opteron)
I have the same behaviour with new cluster when creating mons
http://tracker.ceph.com/issues/20529

What is your cpu model ?

in your log: 

sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]
sh[1869]:  2: (()+0x110c0) [0x7f6d835cb0c0]
sh[1869]:  3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5585615788b1]
sh[1869]:  4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x55856145ca4c]
sh[1869]:  5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x558561423e6f]
sh[1869]:  6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std:
sh[1869]:  7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb:
sh[1869]:  8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5585610af76e]
sh[1869]:  9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5585610b0d27]
sh[1869]:  10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6]
sh[1869]:  11: (BlueStore::mkfs()+0x856) [0x55856106d406]
sh[1869]:  12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x348) [0x558560bc98f8]
sh[1869]:  13: (main()+0xe58) [0x558560b1da78]
sh[1869]:  14: (__libc_start_main()+0xf1) [0x7f6d825802b1]
sh[1869]:  15: (_start()+0x2a) [0x558560ba4dfa]
sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal instruction) **
sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]

----- Mail original -----
De: "Phil Schwarz" <infolist@xxxxxxxxxxxxxx>
À: "Udo Lembke" <ulembke@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Dimanche 16 Juillet 2017 15:04:16
Objet: Re:  Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

Le 15/07/2017 à 23:09, Udo Lembke a écrit : 
> Hi, 
> 
> On 15.07.2017 16:01, Phil Schwarz wrote: 
>> Hi, 
>> ... 
>> 
>> While investigating, i wondered about my config : 
>> Question relative to /etc/hosts file : 
>> Should i use private_replication_LAN Ip or public ones ? 
> private_replication_LAN!! And the pve-cluster should use another network 
> (nics) if possible. 
> 
> Udo 
> 
OK, thanks Udo. 

After investigation, i did : 
- set Noout OSDs 
- Stopped CPU-pegging LXC 
- Check the cabling 
- Restart the whole cluster 

Everything went fine ! 

But, when i tried to add a new OSD : 

fdisk /dev/sdc --> Deleted the partition table 
parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system) 
dd if=/dev/null of=/dev/sdc 
ceph-disk zap /dev/sdc 
dd if=/dev/zero of=/dev/sdc bs=10M count=1000 

And recreated the OSD via Web GUI. 
Same result, the OSD is known by the node, but not by the cluster. 

Logs seem to show an issue with this bluestore OSD, have a look at the file. 

I'm gonna give a try to OSD recreating using Filestore. 

Thanks 

_______________________________________________ 
ceph-users mailing list 
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com