Re: Another OSD broken today. How can I recover it?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I created this. http://paste.debian.net/999172/ But the expiration date is too short. So I did this too https://pastebin.com/QfrE71Dg.

What I want to mention is that there's no known cause for what's happening. It's true that time desynch happens on reboot because few millis skew. But ntp corrects it fast. There are no network issues and the log of the osd is in the output.

I only see in other osd the errors that are becoming more and more usual:

2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a shard 2: soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head(3873'5250781 client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 5250781 dd e2de2a76 od ffffffff alloc_hint [0 0])
2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a shard 6: soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head(3873'5250781 client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 5250781 dd e2de2a76 od ffffffff alloc_hint [0 0])
2017-12-05 08:58:56.637777 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head: failed to pick suitable auth object

Digests not matching basically. Someone told me that this can be caused by a faulty disk. So I replaced the offending drive, and now I found the new disk is happening the same. Ok. But this thread is not for checking the source of the problem. This will be done later.

This thread is to try recover an OSD that seems ok to the object store tool. This is:


Why it breaks here?



starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal
osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03 13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)
 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5556eab28790]                                 <--------- HERE
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x661) [0x5556ea4e6601]
 3: (OSD::load_pgs()+0x75a) [0x5556ea43a8aa]
 4: (OSD::init()+0x2026) [0x5556ea445ca6]
 5: (main()+0x2ef1) [0x5556ea3b7301]
 6: (__libc_start_main()+0xf0) [0x7f467886b830]
 7: (_start()+0x29) [0x5556ea3f8b09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-12-03 13:39:29.497091 7f467ba0b8c0 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03 13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)


So it looks like the offending code is this one:

  int r = store->omap_get_values(coll, pgmeta_oid, keys, &values);
  if (r == 0) {
    assert(values.size() == 2);     <------ Here

    // sanity check version


While the object store tool can run it without any problem. As you can see here:


ceph-objectstore-tool --debug --op list-pgs --data-path /var/lib/ceph/osd/ceph-4 --journal-path /dev/sdf3
2017-12-05 09:18:25.885258 7f5dd8b94a40  0 filestore(/var/lib/ceph/osd/ceph-4) backend xfs (magic 0x58465342)
2017-12-05 09:18:25.885715 7f5dd8b94a40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-12-05 09:18:25.885734 7f5dd8b94a40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-12-05 09:18:25.885755 7f5dd8b94a40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: splice is supported
2017-12-05 09:18:25.910484 7f5dd8b94a40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2017-12-05 09:18:25.910545 7f5dd8b94a40  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: extsize is disabled by conf
2017-12-05 09:18:26.639796 7f5dd8b94a40  0 filestore(/var/lib/ceph/osd/ceph-4) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2017-12-05 09:18:26.650560 7f5dd8b94a40  1 journal _open /dev/sdf3 fd 11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-12-05 09:18:26.662606 7f5dd8b94a40  1 journal _open /dev/sdf3 fd 11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-12-05 09:18:26.664869 7f5dd8b94a40  1 filestore(/var/lib/ceph/osd/ceph-4) upgrade
Cluster fsid=9028f4da-0d77-462b-be9b-dbdf7fa57771
Supported features: compat={},rocompat={},incompat={1=initial feature set(~v.18),2=pginfo object,3=object locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded objects,12=transaction hints,13=pg meta object}
On-disk features: compat={},rocompat={},incompat={1=initial feature set(~v.18),2=pginfo object,3=object locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded objects,12=transaction hints,13=pg meta object}
Performing list-pgs operation
....





On 04/12/17 12:21, Ronny Aasen wrote:
ceph health detail
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux