Re: osd stops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Martin,

On Tue, 2011-04-12 at 18:26 +0200, Martin Wilderoth wrote:
> I have been done some tests and it seems as I always get the same problem.
> I have been transfering data and suddenly I get I/O error and superblock problem.
> This occurs when the filesystem is filled to aprox 80%
> 
> ceph health reports no error. I restart the system -a stop -a start
> after that the system is degraded and the osd stopes.
> 
> The log shows of the fist failing osd
> 
> 2011-04-12 17:51:07.716513 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).fault first fault
> 2011-04-12 17:51:07.716868 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).connect claims to be 0.0.0.0:6802/15976 not 10.0.6.12:6802/13633 - wrong node!
> os/FileStore.cc: In function 'void FileStore::sync_entry()', in thread '0x7f023f9ce700'
> os/FileStore.cc: 2674: FAILED assert(r == 0)
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (FileStore::sync_entry()+0x1975) [0x59f165]
>  2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  3: (()+0x68ba) [0x7f024602b8ba]
>  4: (clone()+0x6d) [0x7f0244cc002d]
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (FileStore::sync_entry()+0x1975) [0x59f165]
>  2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  3: (()+0x68ba) [0x7f024602b8ba]
>  4: (clone()+0x6d) [0x7f0244cc002d]
> *** Caught signal (Aborted) **
>  in thread 0x7f023f9ce700
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: /usr/bin/cosd() [0x61e42c]
>  2: (()+0xef60) [0x7f0246033f60]
>  3: (gsignal()+0x35) [0x7f0244c23165]
>  4: (abort()+0x180) [0x7f0244c25f70]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f02454b6dc5]
>  6: (()+0xcb166) [0x7f02454b5166]
>  7: (()+0xcb193) [0x7f02454b5193]
>  8: (()+0xcb28e) [0x7f02454b528e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
>  10: (FileStore::sync_entry()+0x1975) [0x59f165]
>  11: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  12: (()+0x68ba) [0x7f024602b8ba]
>  13: (clone()+0x6d) [0x7f0244cc002d]
> 
> the second failing osd
> 
> 2011-04-12 18:03:36.036420 7f39c6ce7700 FileStore: sync_entry timed out after 600 seconds.
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
> 2011-04-12 18:03:36.036494 1: (SafeTimer::timer_thread()+0x36b) [0x601afb]
> 2011-04-12 18:03:36.036509 2: (SafeTimerThread::entry()+0xd) [0x6042cd]
> 2011-04-12 18:03:36.036528 3: (()+0x68ba) [0x7f39d034a8ba]
> 2011-04-12 18:03:36.036541 4: (clone()+0x6d) [0x7f39cefdf02d]
> 2011-04-12 18:03:36.036551 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)', in thread '0x7f39c6ce7700'
> os/FileStore.cc: 2573: FAILED assert(0)
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  3: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  4: (()+0x68ba) [0x7f39d034a8ba]
>  5: (clone()+0x6d) [0x7f39cefdf02d]
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  3: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  4: (()+0x68ba) [0x7f39d034a8ba]
>  5: (clone()+0x6d) [0x7f39cefdf02d]
> *** Caught signal (Aborted) **
>  in thread 0x7f39c6ce7700
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: /usr/bin/cosd() [0x61e42c]
>  2: (()+0xef60) [0x7f39d0352f60]
>  3: (gsignal()+0x35) [0x7f39cef42165]
>  4: (abort()+0x180) [0x7f39cef44f70]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f39cf7d5dc5]
>  6: (()+0xcb166) [0x7f39cf7d4166]
>  7: (()+0xcb193) [0x7f39cf7d4193]
>  8: (()+0xcb28e) [0x7f39cf7d428e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
>  10: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  11: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  12: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  13: (()+0x68ba) [0x7f39d034a8ba]
>  14: (clone()+0x6d) [0x7f39cefdf02d]

This seems to me that you have a disk I/O problem, where the OSD can't
commit it's data fast enough and exits.

Does "dmesg" show any disk errors?

Wido

> 
> regards Martin
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux