Re: Ceph OSD crash starting up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I was on a old version of ceph. And it showed a warning saying:

crush map has straw_calc_version=0

I rode that adjusting it will only rebalance all so admin should select when to do it. So I went straigth and ran:


ceph osd crush tunables optimal


It rebalanced as it said but then I started to have lots of pg wrong. I discovered that it was because my OSD1. I thought it was disk faillure so I added a new OSD6 and system started to rebalance. Anyway OSD was not starting.

I thought to wipe it all. But I preferred to leave disk as it was, and journal intact, in case I can recover and get data from it. (See mail: Scrub failing all the time, new inconsistencies keep appearing).


So here's the information. But it has OSD1 replaced by OSD3, sorry.

ID WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
 0 1.00000  1.00000  926G  271G  654G 29.34 1.10 369
 2 1.00000  1.00000  460G  284G  176G 61.67 2.32 395
 4 1.00000  1.00000  465G  151G  313G 32.64 1.23 214
 3 1.36380  1.00000 1396G  239G 1157G 17.13 0.64 340
 6 0.90919  1.00000  931G  164G  766G 17.70 0.67 210
              TOTAL 4179G 1111G 3067G 26.60         
MIN/MAX VAR: 0.64/2.32  STDDEV: 16.99

As I said I still have OSD1 intact so I can do whatever you need except readding to the cluster. Since I don't know what It will do, maybe cause havok.
Best regards,

On 14/09/17 17:12, David Turner wrote:
What do you mean by "updated crush map to 1"?  Can you please provide a copy of your crush map and `ceph osd df`?

On Wed, Sep 13, 2017 at 6:39 AM Gonzalo Aguilar Delgado <gaguilar@xxxxxxxxxxxxxxxxxx> wrote:

Hi,

I'recently updated crush map to 1 and did all relocation of the pgs. At the end I found that one of the OSD is not starting.

This is what it shows:


2017-09-13 10:37:34.287248 7f49cbe12700 -1 *** Caught signal (Aborted) **
 in thread 7f49cbe12700 thread_name:filestore_sync

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x9616ee) [0xa93c6ef6ee]
 2: (()+0x11390) [0x7f49d9937390]
 3: (gsignal()+0x38) [0x7f49d78d3428]
 4: (abort()+0x16a) [0x7f49d78d502a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0xa93c7ef43b]
 6: (FileStore::sync_entry()+0x2bbb) [0xa93c47fcbb]
 7: (FileStore::SyncThread::entry()+0xd) [0xa93c4adcdd]
 8: (()+0x76ba) [0x7f49d992d6ba]
 9: (clone()+0x6d) [0x7f49d79a53dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -3> 2017-09-13 10:37:34.253808 7f49dac6e8c0  5 osd.1 pg_epoch: 6293 pg[10.8c( v 6220'575937 (4942'572901,6220'575937] local-les=6235 n=282 ec=419 les/c/f 6235/6235/0 6293/6293/6290) [1,2]/[2] r=-1 lpr=0 pi=6234-6292/24 crt=6220'575937 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.029683 0 0.000000
    -2> 2017-09-13 10:37:34.253848 7f49dac6e8c0  5 osd.1 pg_epoch: 6293 pg[10.8c( v 6220'575937 (4942'572901,6220'575937] local-les=6235 n=282 ec=419 les/c/f 6235/6235/0 6293/6293/6290) [1,2]/[2] r=-1 lpr=0 pi=6234-6292/24 crt=6220'575937 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
    -1> 2017-09-13 10:37:34.255018 7f49dac6e8c0  5 osd.1 pg_epoch: 6293 pg[10.90(unlocked)] enter Initial
     0> 2017-09-13 10:37:34.287248 7f49cbe12700 -1 *** Caught signal (Aborted) **
 in thread 7f49cbe12700 thread_name:filestore_sync

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x9616ee) [0xa93c6ef6ee]
 2: (()+0x11390) [0x7f49d9937390]
 3: (gsignal()+0x38) [0x7f49d78d3428]
 4: (abort()+0x16a) [0x7f49d78d502a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0xa93c7ef43b]
 6: (FileStore::sync_entry()+0x2bbb) [0xa93c47fcbb]
 7: (FileStore::SyncThread::entry()+0xd) [0xa93c4adcdd]
 8: (()+0x76ba) [0x7f49d992d6ba]
 9: (clone()+0x6d) [0x7f49d79a53dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.1.log
--- end dump of recent events ---



Is there any way to recover it or should I open a bug?


Best regards

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment: crushmap-9028f4da-0d77-462b-be9b-dbdf7fa57771-20171014.crushmap
Description: image/icon

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux