Hi all, we had an ceph cluster with 7 OSD-nodes (Debian Jessie (because patched tcmalloc) with ceph 0.94) which we expand with one further node. For this node we use puppet with Debian 7.8, because ceph 0.92.2 doesn't install on Jessie (upgrade 0.94.1 work on the other nodes but 0.94.2 looks not clean because the package ceph are still 0.94.1). The ceph.conf is systemwide the same and the OSDs are on all nodes initialized with ceph-deploy (only some exceptions). All OSDs are used ext4, switched from xfs during the cluster run ceph 0.80.7, filestore xattr use omap = true are used in ceph.conf. I'm wondering that the omap-format is different on the nodes. The new wheezy node use .sst files: ls -lsa /var/lib/ceph/osd/ceph-92/current/omap/ ... 2084 -rw-r--r-- 1 root root 2131113 Jul 20 17:45 000098.sst 2084 -rw-r--r-- 1 root root 2131913 Jul 20 17:45 000099.sst 2084 -rw-r--r-- 1 root root 2130623 Jul 20 17:45 000111.sst ... Due the jessie nodes use levelDB: ls -lsa /var/lib/ceph/osd/ceph-1/current/omap/ ... 2084 -rw-r--r-- 1 root root 2130468 Jul 20 22:33 000080.ldb 2084 -rw-r--r-- 1 root root 2130827 Jul 20 22:33 000081.ldb 2084 -rw-r--r-- 1 root root 2130171 Jul 20 22:33 000088.ldb ... On some OSDs I found old .sst files which came out of wheezy/ceph 0.87 times: ls -lsa /var/lib/ceph/osd/ceph-23/current/omap/*.sst 2096 -rw-r--r-- 1 root root 2142558 Apr 3 15:59 /var/lib/ceph/osd/ceph-23/current/omap/016722.sst 2092 -rw-r--r-- 1 root root 2141968 Apr 3 15:59 /var/lib/ceph/osd/ceph-23/current/omap/016723.sst 2092 -rw-r--r-- 1 root root 2141679 Apr 3 15:59 /var/lib/ceph/osd/ceph-23/current/omap/016724.sst 2096 -rw-r--r-- 1 root root 2142376 Apr 3 15:59 /var/lib/ceph/osd/ceph-23/current/omap/016725.sst 2096 -rw-r--r-- 1 root root 2142227 Apr 3 15:59 /var/lib/ceph/osd/ceph-23/current/omap/016726.sst 2092 -rw-r--r-- 1 root root 2141369 Apr 20 21:23 /var/lib/ceph/osd/ceph-23/current/omap/019470.sst But much more .ldb-files ls -lsa /var/lib/ceph/osd/ceph-23/current/omap/*.ldb | wc -l 128 The config shows for OSDs on both nodes (old and new with .sst-files) as backend leveldb: ceph --admin-daemon /var/run/ceph/ceph-osd.92.asok config show | grep -i omap "filestore_omap_backend": "leveldb", "filestore_debug_omap_check": "false", "filestore_omap_header_cache_size": "1024", Normaly I would not care about that, but I tried to switch the first OSD-Node to an clean puppet install and see, that none OSD are started. The error message looks a little bit like http://tracker.ceph.com/issues/11429 but this should not happens, because the puppet install has ceph 0.94.2. Error message during start: cat ceph-osd.0.log 2015-07-20 16:51:29.435081 7fb47b126840 0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 9803 2015-07-20 16:51:29.457776 7fb47b126840 0 filestore(/var/lib/ceph/osd/ceph-0) backend generic (magic 0xef53) 2015-07-20 16:51:29.460470 7fb47b126840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is supported and appears to work 2015-07-20 16:51:29.460479 7fb47b126840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-07-20 16:51:29.485120 7fb47b126840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-07-20 16:51:29.572670 7fb47b126840 0 filestore(/var/lib/ceph/osd/ceph-0) limited size xattrs 2015-07-20 16:51:29.889599 7fb47b126840 0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-07-20 16:51:31.517179 7fb47b126840 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello 2015-07-20 16:51:31.552366 7fb47b126840 0 osd.0 151644 crush map has features 2303210029056, adjusting msgr requires for clients 2015-07-20 16:51:31.552375 7fb47b126840 0 osd.0 151644 crush map has features 2578087936000 was 8705, adjusting msgr requires for mons 2015-07-20 16:51:31.552382 7fb47b126840 0 osd.0 151644 crush map has features 2578087936000, adjusting msgr requires for osds 2015-07-20 16:51:31.552394 7fb47b126840 0 osd.0 151644 load_pgs 2015-07-20 16:51:42.682678 7fb47b126840 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fb47b126840 time 2015-07-20 16:51:42.680036 osd/PG.cc: 2825: FAILED assert(values.size() == 2) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcdb572] 2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0x7b2) [0x908742] 3: (OSD::load_pgs()+0x734) [0x7e9064] 4: (OSD::init()+0xdac) [0x7ed8fc] 5: (main()+0x253e) [0x79069e] 6: (__libc_start_main()+0xfd) [0x7fb47898fead] 7: /usr/bin/ceph-osd() [0x7966b9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ... Normaly I would say, if one OSD-Node die, I simply reinstall the OS and ceph and I'm back again... but this looks bad for me. Unfortunality the system also don't start 9 OSDs as I switched back to the old system-disk... (only three of the big OSDs are running well) What is the best solution for that? Empty one node (crush weight 0), fresh reinstall OS/ceph, reinitialise all OSDs? This will take a long long time, because we use 173TB in this cluster... I'm happy if somebody has any hints. Udo _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com