Re: Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

nokia ceph <nokiacephusers@xxxxxxxxx> · Thu, 8 Jun 2017 16:38:36 +0530

Hello Mark,
Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222

Jake can you share the restart_OSD_and_log-this.sh script 

Thanks
Jayaram

On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
Hi Mark & List,

Unfortunately, even when using yesterdays master version of ceph,

I'm still seeing OSDs go down, same error as before:

OSD log shows lots of entries like this:

(osd38)

2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy

'tp_osd_tp thread tp_osd_tp' had timed out after 60

(osd3)

2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy

'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:

no reply from 10.1.0.86:6811 osd.2 since back 2017-06-07 17:00:19.640002

front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)

[root@ceph4 ceph]# ceph -v

ceph version 12.0.2-2399-ge38ca14

(e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)

I'll continue running the cluster with my "restart_OSD_and_log-this.sh"

workaround...

thanks again for your help,

Jake

On 06/06/17 15:52, Jake Grimmett wrote:

> Hi Mark,

>

> OK, I'll upgrade to the current master and retest...

>

> best,

>

> Jake

>

> On 06/06/17 15:46, Mark Nelson wrote:

>> Hi Jake,

>>

>> I just happened to notice this was on 12.0.3.  Would it be possible to

>> test this out with current master and see if it still is a problem?

>>

>> Mark

>>

>> On 06/06/2017 09:10 AM, Mark Nelson wrote:

>>> Hi Jake,

>>>

>>> Thanks much.  I'm guessing at this point this is probably a bug.  Would

>>> you (or nokiauser) mind creating a bug in the tracker with a short

>>> description of what's going on and the collectl sample showing this is

>>> not IOs backing up on the disk?

>>>

>>> If you want to try it, we have a gdb based wallclock profiler that might

>>> be interesting to run while it's in the process of timing out.  It tries

>>> to grab 2000 samples from the osd process which typically takes about 10

>>> minutes or so.  You'll need to either change the number of samples to be

>>> lower in the python code (maybe like 50-100), or change the timeout to

>>> be something longer.

>>>

>>> You can find the code here:

>>>

>>> https://github.com/markhpc/gdbprof

>>>

>>> and invoke it like:

>>>

>>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source

>>> ./gdbprof.py' -ex 'profile begin' -ex 'quit'

>>>

>>> where 27962 in this case is the PID of the ceph-osd process.  You'll

>>> need gdb with the python bindings and the ceph debug symbols for it to

>>> work.

>>>

>>> This might tell us over time if the tp_osd_tp processes are just sitting

>>> on pg::locks.

>>>

>>> Mark

>>>

>>> On 06/06/2017 05:34 AM, Jake Grimmett wrote:

>>>> Hi Mark,

>>>>

>>>> Thanks again for looking into this problem.

>>>>

>>>> I ran the cluster overnight, with a script checking for dead OSDs every

>>>> second, and restarting them.

>>>>

>>>> 40 OSD failures occurred in 12 hours, some OSDs failed multiple times,

>>>> (there are 50 OSDs in the EC tier).

>>>>

>>>> Unfortunately, the output of collectl doesn't appear to show any

>>>> increase in disk queue depth and service times before the OSDs die.

>>>>

>>>> I've put a couple of examples of collectl output for the disks

>>>> associated with the OSDs here:

>>>>

>>>> https://hastebin.com/icuvotemot.scala

>>>>

>>>> please let me know if you need more info...

>>>>

>>>> best regards,

>>>>

>>>> Jake

>>>>

>>>>

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com