Hi Phil, It's possible that rocksdb have a bug with some old cpus currently (old xeon and some opteron) I have the same behaviour with new cluster when creating mons http://tracker.ceph.com/issues/20529 What is your cpu model ? in your log: sh[1869]: in thread 7f6d85db3c80 thread_name:ceph-osd sh[1869]: ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev) sh[1869]: 1: (()+0x9bc562) [0x558561169562] sh[1869]: 2: (()+0x110c0) [0x7f6d835cb0c0] sh[1869]: 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5585615788b1] sh[1869]: 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x55856145ca4c] sh[1869]: 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x558561423e6f] sh[1869]: 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std: sh[1869]: 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb: sh[1869]: 8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5585610af76e] sh[1869]: 9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5585610b0d27] sh[1869]: 10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6] sh[1869]: 11: (BlueStore::mkfs()+0x856) [0x55856106d406] sh[1869]: 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x348) [0x558560bc98f8] sh[1869]: 13: (main()+0xe58) [0x558560b1da78] sh[1869]: 14: (__libc_start_main()+0xf1) [0x7f6d825802b1] sh[1869]: 15: (_start()+0x2a) [0x558560ba4dfa] sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal instruction) ** sh[1869]: in thread 7f6d85db3c80 thread_name:ceph-osd sh[1869]: ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev) sh[1869]: 1: (()+0x9bc562) [0x558561169562] ----- Mail original ----- De: "Phil Schwarz" <infolist@xxxxxxxxxxxxxx> À: "Udo Lembke" <ulembke@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx> Envoyé: Dimanche 16 Juillet 2017 15:04:16 Objet: Re: Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous Le 15/07/2017 à 23:09, Udo Lembke a écrit : > Hi, > > On 15.07.2017 16:01, Phil Schwarz wrote: >> Hi, >> ... >> >> While investigating, i wondered about my config : >> Question relative to /etc/hosts file : >> Should i use private_replication_LAN Ip or public ones ? > private_replication_LAN!! And the pve-cluster should use another network > (nics) if possible. > > Udo > OK, thanks Udo. After investigation, i did : - set Noout OSDs - Stopped CPU-pegging LXC - Check the cabling - Restart the whole cluster Everything went fine ! But, when i tried to add a new OSD : fdisk /dev/sdc --> Deleted the partition table parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system) dd if=/dev/null of=/dev/sdc ceph-disk zap /dev/sdc dd if=/dev/zero of=/dev/sdc bs=10M count=1000 And recreated the OSD via Web GUI. Same result, the OSD is known by the node, but not by the cluster. Logs seem to show an issue with this bluestore OSD, have a look at the file. I'm gonna give a try to OSD recreating using Filestore. Thanks _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com