Re: Fwd: OSD crashes after upgrade to 0.80.10

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



An update:

It seems that I am arriving at memory shortage. Even with 32 GB for 20 OSDs and 2 GB swap, ceph-osd uses all available memory.
I created another swap device with 10 GB, and I managed to get the failed OSD running without crash, but consuming extra 5 GB.
Are there known issues regarding memory on ceph osd?

But I still get the problem of the incomplete+inactive PG.

Regards.

Gerd

On 12-08-2015 10:11, Gerd Jakobovitsch wrote:
I tried it, the error propagates to whichever OSD gets the errorred PG.

For the moment, this is my worst problem. I have one PG incomplete+inactive, and the OSD with the highest priority in it gets 100 blocked requests (I guess that is the maximum), and, although running, doesn't get other requests - for example, ceph tell osd.21 injectargs '--osd-max-backfills 1'. After some time, it crashes, and the blocked requests go to the second OSD for the errorred PG. I can't get rid of these slow requests.

I guessed a problem with leveldb, I checked, and had the default version for debian wheezy (0+20120530.gitdd0d562-1). I updated it for wheezy-backports (1.17-1~bpo70+1), but the error was the same.

I use regular wheezy kernel (3.2+46).

On 11-08-2015 23:52, Haomai Wang wrote:
it seems like a leveldb problem. could you just kick it out and add a
new osd to make cluster healthy firstly?

On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch <gerd@xxxxxxxxxxxxx> wrote:
Dear all,

I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75%
usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and
since then I got several OSDs crashing and never recovering: trying to run
it, ends up crashing as follows.

Is this problem known? Is there any configuration that should be checked?
Any way to try to recover these OSDs without losing all data?

After that, setting the OSD to lost, I got one incomplete, inactive PG. Is
there any way to recover it? Data still exists in crashed OSDs.

Regards.

[(12:58:13) root@spcsnp3 ~]# service ceph start osd.7
=== osd.7 ===
2015-08-11 12:58:21.003876 7f17ed52b700  1 monclient(hunting): found
mon.spcsmp2
2015-08-11 12:58:21.003915 7f17ef493700  5 monclient: authenticate success,
global_id 206010466
create-or-move updated item name 'osd.7' weight 3.64 at location
{host=spcsnp3,root=default} to crush map
Starting Ceph osd.7 on spcsnp3...
2015-08-11 12:58:21.279878 7f200fa8f780  0 ceph version 0.80.10
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918
starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7
/var/lib/ceph/osd/ceph-7/journal
[(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) dump_stop
2015-08-11 12:58:21.348291 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal
/var/lib/ceph/osd/ceph-7/journal
2015-08-11 12:58:21.348326 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) mount fsid is
54c136da-c51c-4799-b2dc-b7988982ee00
2015-08-11 12:58:21.349010 7f200fa8f780  0
filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs)
2015-08-11 12:58:21.349026 7f200fa8f780  1
filestore(/var/lib/ceph/osd/ceph-7)  disabling 'filestore replica fadvise'
due to known issues with fadvise(DONTNEED) on xfs
2015-08-11 12:58:21.353277 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is supported and appears to work
2015-08-11 12:58:21.353302 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-08-11 12:58:21.362106 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-08-11 12:58:21.362195 7f200fa8f780  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
disabled by conf
2015-08-11 12:58:21.362701 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995
2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) **
 in thread 7f200fa8f780

 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
 1: /usr/bin/ceph-osd() [0xab7562]
 2: (()+0xf0a0) [0x7f200efcd0a0]
 3: (gsignal()+0x35) [0x7f200db3f165]
 4: (abort()+0x180) [0x7f200db423e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d]
 6: (()+0x63996) [0x7f200e393996]
 7: (()+0x639c3) [0x7f200e3939c3]
 8: (()+0x63bee) [0x7f200e393bee]
 9: (tc_new()+0x48e) [0x7f200f213aee]
 10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocator<char> const&)+0x59) [0x7f200e3ef999]
 11: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned
long)+0x28) [0x7f200e3f0708]
 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0]
 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5]
 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2)
[0x7f200f46ffa2]
 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*,
unsigned long*)+0x180) [0x7f200f468360]
 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2)
[0x7f200f46adf2]
 17: (leveldb::DB::Open(leveldb::Options const&, std::string const&,
leveldb::DB**)+0xff) [0x7f200f46b11f]
 18: (LevelDBStore::do_open(std::ostream&, bool)+0xd8) [0xa123a8]
 19: (FileStore::mount()+0x18e0) [0x9b7080]
 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
 21: (main()+0x2234) [0x7331c4]
 22: (__libc_start_main()+0xfd) [0x7f200db2bead]
 23: /usr/bin/ceph-osd() [0x736e99]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

--- begin dump of recent events ---
   -66> 2015-08-11 12:58:21.277524 7f200fa8f780  5 asok(0x2800230)
register_command perfcounters_dump hook 0x27f0010
   -65> 2015-08-11 12:58:21.277552 7f200fa8f780  5 asok(0x2800230)
register_command 1 hook 0x27f0010
   -64> 2015-08-11 12:58:21.277556 7f200fa8f780  5 asok(0x2800230)
register_command perf dump hook 0x27f0010
   -63> 2015-08-11 12:58:21.277561 7f200fa8f780  5 asok(0x2800230)
register_command perfcounters_schema hook 0x27f0010
   -62> 2015-08-11 12:58:21.277564 7f200fa8f780  5 asok(0x2800230)
register_command 2 hook 0x27f0010
   -61> 2015-08-11 12:58:21.277566 7f200fa8f780  5 asok(0x2800230)
register_command perf schema hook 0x27f0010
   -60> 2015-08-11 12:58:21.277569 7f200fa8f780  5 asok(0x2800230)
register_command config show hook 0x27f0010
   -59> 2015-08-11 12:58:21.277573 7f200fa8f780  5 asok(0x2800230)
register_command config set hook 0x27f0010
   -58> 2015-08-11 12:58:21.277575 7f200fa8f780  5 asok(0x2800230)
register_command config get hook 0x27f0010
   -57> 2015-08-11 12:58:21.277578 7f200fa8f780  5 asok(0x2800230)
register_command log flush hook 0x27f0010
   -56> 2015-08-11 12:58:21.277581 7f200fa8f780  5 asok(0x2800230)
register_command log dump hook 0x27f0010
   -55> 2015-08-11 12:58:21.277583 7f200fa8f780  5 asok(0x2800230)
register_command log reopen hook 0x27f0010
   -54> 2015-08-11 12:58:21.279878 7f200fa8f780  0 ceph version 0.80.10
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918
   -53> 2015-08-11 12:58:21.345764 7f200fa8f780  1 -- 10.17.0.7:0/0 learned
my addr 10.17.0.7:0/0
   -52> 2015-08-11 12:58:21.345778 7f200fa8f780  1 accepter.accepter.bind
my_inst.addr is 10.17.0.7:6813/31918 need_addr=0
   -51> 2015-08-11 12:58:21.345792 7f200fa8f780  1 -- 10.18.0.7:0/0 learned
my addr 10.18.0.7:0/0
   -50> 2015-08-11 12:58:21.345795 7f200fa8f780  1 accepter.accepter.bind
my_inst.addr is 10.18.0.7:6808/31918 need_addr=0
   -49> 2015-08-11 12:58:21.345805 7f200fa8f780  1 -- 10.18.0.7:0/0 learned
my addr 10.18.0.7:0/0
   -48> 2015-08-11 12:58:21.345809 7f200fa8f780  1 accepter.accepter.bind
my_inst.addr is 10.18.0.7:6809/31918 need_addr=0
   -47> 2015-08-11 12:58:21.345827 7f200fa8f780  1 -- 10.17.0.7:0/0 learned
my addr 10.17.0.7:0/0
   -46> 2015-08-11 12:58:21.345830 7f200fa8f780  1 accepter.accepter.bind
my_inst.addr is 10.17.0.7:6824/31918 need_addr=0
   -45> 2015-08-11 12:58:21.345847 7f200fa8f780  1 -- 10.17.0.7:0/0 learned
my addr 10.17.0.7:0/0
   -44> 2015-08-11 12:58:21.345851 7f200fa8f780  1 accepter.accepter.bind
my_inst.addr is 10.17.0.7:6825/31918 need_addr=0
   -43> 2015-08-11 12:58:21.346156 7f200fa8f780  1 finished
global_init_daemonize
   -42> 2015-08-11 12:58:21.348094 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) dump_stop
   -41> 2015-08-11 12:58:21.348119 7f200fa8f780  5 asok(0x2800230) init
/var/run/ceph/ceph-osd.7.asok
   -40> 2015-08-11 12:58:21.348134 7f200fa8f780  5 asok(0x2800230)
bind_and_listen /var/run/ceph/ceph-osd.7.asok
   -39> 2015-08-11 12:58:21.348232 7f200fa8f780  5 asok(0x2800230)
register_command 0 hook 0x27ee0b0
   -38> 2015-08-11 12:58:21.348242 7f200fa8f780  5 asok(0x2800230)
register_command version hook 0x27ee0b0
   -37> 2015-08-11 12:58:21.348246 7f200fa8f780  5 asok(0x2800230)
register_command git_version hook 0x27ee0b0
   -36> 2015-08-11 12:58:21.348250 7f200fa8f780  5 asok(0x2800230)
register_command help hook 0x27f00b0
   -35> 2015-08-11 12:58:21.348254 7f200fa8f780  5 asok(0x2800230)
register_command get_command_descriptions hook 0x27f0150
   -34> 2015-08-11 12:58:21.348278 7f200b749700  5 asok(0x2800230) entry
start
   -33> 2015-08-11 12:58:21.348291 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal
/var/lib/ceph/osd/ceph-7/journal
   -32> 2015-08-11 12:58:21.348326 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) mount fsid is
54c136da-c51c-4799-b2dc-b7988982ee00
   -31> 2015-08-11 12:58:21.349010 7f200fa8f780  0
filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs)
   -30> 2015-08-11 12:58:21.349026 7f200fa8f780  1
filestore(/var/lib/ceph/osd/ceph-7)  disabling 'filestore replica fadvise'
due to known issues with fadvise(DONTNEED) on xfs
   -29> 2015-08-11 12:58:21.353277 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is supported and appears to work
   -28> 2015-08-11 12:58:21.353302 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
   -27> 2015-08-11 12:58:21.362106 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syscall(SYS_syncfs, fd) fully supported
   -26> 2015-08-11 12:58:21.362195 7f200fa8f780  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
disabled by conf
   -25> 2015-08-11 12:58:21.362701 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995
   -24> 2015-08-11 12:58:24.458593 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
   -23> 2015-08-11 12:58:24.462824 7f200b749700  1 do_command 'config get'
'format:json var:fsid
   -22> 2015-08-11 12:58:24.462850 7f200b749700  1 do_command 'config get'
'format:json var:fsid result is 47 bytes
   -21> 2015-08-11 12:58:24.462853 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes
   -20> 2015-08-11 12:58:24.463194 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
   -19> 2015-08-11 12:58:24.467886 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes
   -18> 2015-08-11 12:58:34.118231 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
   -17> 2015-08-11 12:58:34.122484 7f200b749700  1 do_command 'config get'
'format:json var:fsid
   -16> 2015-08-11 12:58:34.122503 7f200b749700  1 do_command 'config get'
'format:json var:fsid result is 47 bytes
   -15> 2015-08-11 12:58:34.122506 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes
   -14> 2015-08-11 12:58:34.122739 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
   -13> 2015-08-11 12:58:34.125503 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes
   -12> 2015-08-11 12:58:44.136424 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
   -11> 2015-08-11 12:58:44.140286 7f200b749700  1 do_command 'config get'
'format:json var:fsid
   -10> 2015-08-11 12:58:44.140304 7f200b749700  1 do_command 'config get'
'format:json var:fsid result is 47 bytes
    -9> 2015-08-11 12:58:44.140309 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes
    -8> 2015-08-11 12:58:44.140530 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
    -7> 2015-08-11 12:58:44.143236 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes
    -6> 2015-08-11 12:58:54.493800 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
    -5> 2015-08-11 12:58:54.497564 7f200b749700  1 do_command 'config get'
'format:json var:fsid
    -4> 2015-08-11 12:58:54.497586 7f200b749700  1 do_command 'config get'
'format:json var:fsid result is 47 bytes
    -3> 2015-08-11 12:58:54.497591 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes
    -2> 2015-08-11 12:58:54.497905 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned
1164 bytes
    -1> 2015-08-11 12:58:54.500762 7f200b749700  5 asok(0x2800230)
AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes
     0> 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal
(Aborted) **
 in thread 7f200fa8f780

 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
 1: /usr/bin/ceph-osd() [0xab7562]
 2: (()+0xf0a0) [0x7f200efcd0a0]
 3: (gsignal()+0x35) [0x7f200db3f165]
 4: (abort()+0x180) [0x7f200db423e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d]
 6: (()+0x63996) [0x7f200e393996]
 7: (()+0x639c3) [0x7f200e3939c3]
 8: (()+0x63bee) [0x7f200e393bee]
 9: (tc_new()+0x48e) [0x7f200f213aee]
 10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocator<char> const&)+0x59) [0x7f200e3ef999]
 11: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned
long)+0x28) [0x7f200e3f0708]
 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0]
 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5]
 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2)
[0x7f200f46ffa2]
 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*,
unsigned long*)+0x180) [0x7f200f468360]
 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2)
[0x7f200f46adf2]
 17: (leveldb::DB::Open(leveldb::Options const&, std::string const&,
leveldb::DB**)+0xff) [0x7f200f46b11f]
 18: (LevelDBStore::do_open(std::ostream&, bool)+0xd8) [0xa123a8]
 19: (FileStore::mount()+0x18e0) [0x9b7080]
 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
 21: (main()+0x2234) [0x7331c4]
 22: (__libc_start_main()+0xfd) [0x7f200db2bead]
 23: /usr/bin/ceph-osd() [0x736e99]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
  20/20 filestore
   1/ 3 keyvaluestore
  20/20 journal
   0/ 5 ms
   1/ 5 mon
   5/20 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
  20/20 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.7.log
--- end dump of recent events ---




















--


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

 

 

 

 

 

 

 

 


--




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

 

 

 

 

 

 

 

 


--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou qualquer forma de utilização do teor deste documento depende de autorização do emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação tenha sido recebida por engano, favor avisar imediatamente, respondendo esta mensagem.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux