OSD service won't stay running - pg incomplete

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



After restarting several OSD daemons in our ceph cluster a couple days ago, a couple of our OSDs won’t come online. The services start and crash with the below error. We have one pg marked as incomplete, and will not peer. The pool is erasure coded, 2+1, currently set to size=3, min_size=2. The incomplete pg states it is not peering due to:

 

"comment": "not enough complete instances of this PG" and:

           "down_osds_we_would_probe": [

                7,

                16

            ],

7 is completely lost, drive dead, 16 will not come online (refer to log output below).

 

We’ve tried searching user-list and tweaking osd conf settings for several days, to no avail. Reaching out here as a last ditch effort before we have to give up on the pg.

 

tcmalloc: large alloc 1073741824 bytes == 0x560ada35c000 @  0x7f5c1081e4ef 0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e9466cb 0x7f5c0e946774 0x7f5c0e9469df 0x560a8fdb7db0 0x560a8fda8d28 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 0x560a8f514373

tcmalloc: large alloc 2147483648 bytes == 0x560b1a35c000 @  0x7f5c1081e4ef 0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e9466cb 0x7f5c0e946774 0x7f5c0e9469df 0x560a8fdb7db0 0x560a8fda8d28 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 0x560a8f514373

tcmalloc: large alloc 4294967296 bytes == 0x560b9a35c000 @  0x7f5c1081e4ef 0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e9466cb 0x7f5c0e946774 0x7f5c0e9469df 0x560a8fdb7db0 0x560a8fda8d28 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 0x560a8f514373

tcmalloc: large alloc 3840745472 bytes == 0x560a9a334000 @  0x7f5c1081e4ef 0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e945c76 0x7f5c0e94623e 0x560a8fdea280 0x560a8fda8f36 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 0x560a8f514373

tcmalloc: large alloc 2728992768 bytes == 0x560e779ee000 @  0x7f5c1081e4ef 0x7f5c1083f010 0x560a8faa5674 0x560a8faa7125 0x560a8fa835a7 0x560a8fa5aa3c 0x560a8fa5c238 0x560a8fa77dcc 0x560a8fe439ef 0x560a8fe43c03 0x560a8fe5acd4 0x560a8fda75ec 0x560a8fda9260 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 0x560a8f514373

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f5c0a749700 time 2019-03-13 12:46:39.632156

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: 384: FAILED assert(0 == "unexpected aio error")

2019-03-13 12:46:39.632132 7f5c0a749700 -1 bdev(0x560a99c05000 /var/lib/ceph/osd/ceph-16/block) aio to 4817558700032~2728988672 but returned: 2147479552

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x560a8fadd2a0]

 2: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 4: (()+0x7e25) [0x7f5c0efb0e25]

 5: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2019-03-13 12:46:39.633822 7f5c0a749700 -1 /builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f5c0a749700 time 2019-03-13 12:46:39.632156

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: 384: FAILED assert(0 == "unexpected aio error")

 

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x560a8fadd2a0]

 2: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 4: (()+0x7e25) [0x7f5c0efb0e25]

 5: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

    -1> 2019-03-13 12:46:39.632132 7f5c0a749700 -1 bdev(0x560a99c05000 /var/lib/ceph/osd/ceph-16/block) aio to 4817558700032~2728988672 but returned: 2147479552

     0> 2019-03-13 12:46:39.633822 7f5c0a749700 -1 /builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f5c0a749700 time 2019-03-13 12:46:39.632156

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: 384: FAILED assert(0 == "unexpected aio error")

 

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x560a8fadd2a0]

 2: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 4: (()+0x7e25) [0x7f5c0efb0e25]

 5: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

*** Caught signal (Aborted) **

 in thread 7f5c0a749700 thread_name:bstore_aio

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

 1: (()+0xa41911) [0x560a8fa9e911]

 2: (()+0xf6d0) [0x7f5c0efb86d0]

 3: (gsignal()+0x37) [0x7f5c0dfd9277]

 4: (abort()+0x148) [0x7f5c0dfda968]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x560a8fadd414]

 6: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 8: (()+0x7e25) [0x7f5c0efb0e25]

 9: (clone()+0x6d) [0x7f5c0e0a1bad]

2019-03-13 12:46:39.635955 7f5c0a749700 -1 *** Caught signal (Aborted) **

 in thread 7f5c0a749700 thread_name:bstore_aio

 

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

 1: (()+0xa41911) [0x560a8fa9e911]

 2: (()+0xf6d0) [0x7f5c0efb86d0]

 3: (gsignal()+0x37) [0x7f5c0dfd9277]

 4: (abort()+0x148) [0x7f5c0dfda968]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x560a8fadd414]

 6: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 8: (()+0x7e25) [0x7f5c0efb0e25]

 9: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

     0> 2019-03-13 12:46:39.635955 7f5c0a749700 -1 *** Caught signal (Aborted) **

 in thread 7f5c0a749700 thread_name:bstore_aio

 

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

 1: (()+0xa41911) [0x560a8fa9e911]

 2: (()+0xf6d0) [0x7f5c0efb86d0]

 3: (gsignal()+0x37) [0x7f5c0dfd9277]

 4: (abort()+0x148) [0x7f5c0dfda968]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x560a8fadd414]

 6: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 8: (()+0x7e25) [0x7f5c0efb0e25]

 9: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

Aborted

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux