Re: Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

Phil Schwarz <infolist@xxxxxxxxxxxxxx> · Tue, 29 Aug 2017 22:36:14 +0200

Hi, back to work, i face my problem.

@Alexandre : AMDTurion  for N54L HP Microserver.
This server is OSD and LXC only, no mon working in.

After rebooting the whole cluster and attempting to add a third time the 
same disk :

ceph osd tree
ID WEIGHT  TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 7.47226 root default
-2 3.65898     host jon
 1 2.29999         osd.1          up  1.00000          1.00000
 3 1.35899         osd.3          up  1.00000          1.00000
-3 0.34999     host daenerys
 0 0.34999         osd.0          up  1.00000          1.00000
-4 1.64969     host tyrion
 2 0.44969         osd.2          up  1.00000          1.00000
 4 1.20000         osd.4          up  1.00000          1.00000
-5 1.81360     host jaime
 5 1.81360         osd.5          up  1.00000          1.00000
 6       0 osd.6                down        0          1.00000
 7       0 osd.7                down        0          1.00000
 8       0 osd.8                down        0          1.00000

6,7,8 disks are the same issue for the same disk (which isn't faulty).

Any clue ?
I'm gonna try soon to create the osd on this disk in another server.

Thanks.

Best regards
Le 26/07/2017 à 15:53, Alexandre DERUMIER a écrit :
Hi Phil,

It's possible that rocksdb have a bug with some old cpus currently (old xeon and some opteron)
I have the same behaviour with new cluster when creating mons
http://tracker.ceph.com/issues/20529

What is your cpu model ?

in your log:

sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]
sh[1869]:  2: (()+0x110c0) [0x7f6d835cb0c0]
sh[1869]:  3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5585615788b1]
sh[1869]:  4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x55856145ca4c]
sh[1869]:  5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x558561423e6f]
sh[1869]:  6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std:
sh[1869]:  7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb:
sh[1869]:  8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5585610af76e]
sh[1869]:  9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5585610b0d27]
sh[1869]:  10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6]
sh[1869]:  11: (BlueStore::mkfs()+0x856) [0x55856106d406]
sh[1869]:  12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x348) [0x558560bc98f8]
sh[1869]:  13: (main()+0xe58) [0x558560b1da78]
sh[1869]:  14: (__libc_start_main()+0xf1) [0x7f6d825802b1]
sh[1869]:  15: (_start()+0x2a) [0x558560ba4dfa]
sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal instruction) **
sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]

----- Mail original -----
De: "Phil Schwarz" <infolist@xxxxxxxxxxxxxx>
À: "Udo Lembke" <ulembke@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Dimanche 16 Juillet 2017 15:04:16
Objet: Re:  Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

Le 15/07/2017 à 23:09, Udo Lembke a écrit :
Hi,

On 15.07.2017 16:01, Phil Schwarz wrote:
Hi,
...

While investigating, i wondered about my config :
Question relative to /etc/hosts file :
Should i use private_replication_LAN Ip or public ones ?
private_replication_LAN!! And the pve-cluster should use another network
(nics) if possible.

Udo

OK, thanks Udo.

After investigation, i did :
- set Noout OSDs
- Stopped CPU-pegging LXC
- Check the cabling
- Restart the whole cluster

Everything went fine !

But, when i tried to add a new OSD :

fdisk /dev/sdc --> Deleted the partition table
parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system)
dd if=/dev/null of=/dev/sdc
ceph-disk zap /dev/sdc
dd if=/dev/zero of=/dev/sdc bs=10M count=1000

And recreated the OSD via Web GUI.
Same result, the OSD is known by the node, but not by the cluster.

Logs seem to show an issue with this bluestore OSD, have a look at the file.

I'm gonna give a try to OSD recreating using Filestore.

Thanks

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com