Hi,
I created this. http://paste.debian.net/999172/ But the
expiration date is too short. So I did this too
https://pastebin.com/QfrE71Dg.
What I want to mention is that there's no known cause for what's
happening. It's true that time desynch happens on reboot because
few millis skew. But ntp corrects it fast. There are no network
issues and the log of the osd is in the output.
I only see in other osd the errors that are becoming more and
more usual:
2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster)
log [ERR] : 10.7a shard 2: soid
10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head
data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi
10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head(3873'5250781
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304
uv 5250781 dd e2de2a76 od ffffffff alloc_hint [0 0])
2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster)
log [ERR] : 10.7a shard 6: soid
10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head
data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi
10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head(3873'5250781
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304
uv 5250781 dd e2de2a76 od ffffffff alloc_hint [0 0])
2017-12-05 08:58:56.637777 7f0feff7f700 -1 log_channel(cluster)
log [ERR] : 10.7a soid
10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head:
failed to pick suitable auth object
Digests not matching basically. Someone told me that this can be
caused by a faulty disk. So I replaced the offending drive, and
now I found the new disk is happening the same. Ok. But this
thread is not for checking the source of the problem. This will be
done later.
This thread is to try recover an OSD that seems ok to the object
store tool. This is:
Why it breaks here?
starting osd.4
at :/0 osd_data /var/lib/ceph/osd/ceph-4
/var/lib/ceph/osd/ceph-4/journal
osd/PG.cc: In function 'static int
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03
13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80)
[0x5556eab28790] <---------
HERE
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5556ea4e6601]
3: (OSD::load_pgs()+0x75a) [0x5556ea43a8aa]
4: (OSD::init()+0x2026) [0x5556ea445ca6]
5: (main()+0x2ef1) [0x5556ea3b7301]
6: (__libc_start_main()+0xf0) [0x7f467886b830]
7: (_start()+0x29) [0x5556ea3f8b09]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
2017-12-03 13:39:29.497091 7f467ba0b8c0 -1 osd/PG.cc: In
function 'static int PG::peek_map_epoch(ObjectStore*, spg_t,
epoch_t*, ceph::bufferlist*)' thread 7f467ba0b8c0 time
2017-12-03 13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)
So it looks like the offending code is this one:
int r = store->omap_get_values(coll, pgmeta_oid, keys,
&values);
if (r == 0) {
assert(values.size() == 2); <------ Here
// sanity check version
While the object store
tool can run it without any problem. As you can see here:
ceph-objectstore-tool
--debug --op list-pgs --data-path /var/lib/ceph/osd/ceph-4
--journal-path /dev/sdf3
2017-12-05 09:18:25.885258 7f5dd8b94a40 0
filestore(/var/lib/ceph/osd/ceph-4) backend xfs (magic
0x58465342)
2017-12-05 09:18:25.885715 7f5dd8b94a40 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4)
detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
config option
2017-12-05 09:18:25.885734 7f5dd8b94a40 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4)
detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore
seek data hole' config option
2017-12-05 09:18:25.885755 7f5dd8b94a40 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4)
detect_features: splice is supported
2017-12-05 09:18:25.910484 7f5dd8b94a40 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4)
detect_features: syncfs(2) syscall fully supported (by glibc and
kernel)
2017-12-05 09:18:25.910545 7f5dd8b94a40 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature:
extsize is disabled by conf
2017-12-05 09:18:26.639796 7f5dd8b94a40 0
filestore(/var/lib/ceph/osd/ceph-4) mount: enabling WRITEAHEAD
journal mode: checkpoint is not enabled
2017-12-05 09:18:26.650560 7f5dd8b94a40 1 journal _open
/dev/sdf3 fd 11: 5368709120 bytes, block size 4096 bytes,
directio = 1, aio = 1
2017-12-05 09:18:26.662606 7f5dd8b94a40 1 journal _open
/dev/sdf3 fd 11: 5368709120 bytes, block size 4096 bytes,
directio = 1, aio = 1
2017-12-05 09:18:26.664869 7f5dd8b94a40 1
filestore(/var/lib/ceph/osd/ceph-4) upgrade
Cluster fsid=9028f4da-0d77-462b-be9b-dbdf7fa57771
Supported features: compat={},rocompat={},incompat={1=initial
feature set(~v.18),2=pginfo object,3=object
locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded
objects,12=transaction hints,13=pg meta object}
On-disk features: compat={},rocompat={},incompat={1=initial
feature set(~v.18),2=pginfo object,3=object
locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded
objects,12=transaction hints,13=pg meta object}
Performing list-pgs operation
....
On 04/12/17 12:21, Ronny Aasen wrote:
ceph
health detail