Re: Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

nokia ceph <nokiacephusers@xxxxxxxxx> · Thu, 8 Jun 2017 22:34:35 +0530

Thank jake, can you confirm are you testing this in which ceph version - the out of memory you noticed. There is already a memory leak issue reported in kraken v11.2.0 .  which addressed in this tracker ..http://tracker.ceph.com/issues/18924 ..
#ceph -v 

Ok so you are mounting/mapping ceph as a rbd and writing into it. 

We are discussing luminous v12.0.3 issue here, I think we are all on the same path. 

Thanks
Jayaram

On Thu, Jun 8, 2017 at 8:13 PM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
Hi Mark / Jayaram,

After running the cluster last night, I noticed lots of

"Out Of Memory" errors in /var/log/messages, many of these correlate to

dead OSD's. If this is the problem, this might now be another case of

the high memory use issues reported in Kraken.

e.g. my script logs:

Thu 8 Jun 08:26:37 BST 2017  restart OSD  1

and /var/log/messages states...

Jun  8 08:26:35 ceph1 kernel: Out of memory: Kill process 7899

(ceph-osd) score 113 or sacrifice child

Jun  8 08:26:35 ceph1 kernel: Killed process 7899 (ceph-osd)

total-vm:8569516kB, anon-rss:7518836kB, file-rss:0kB, shmem-rss:0kB

Jun  8 08:26:36 ceph1 systemd: ceph-osd@1.service: main process exited,

code=killed, status=9/KILL

Jun  8 08:26:36 ceph1 systemd: Unit ceph-osd@1.service entered failed state.

The OSD nodes have 64GB RAM, presumably enough RAM for 10 OSD's doing

4+1 EC ?

I've added "bluestore_cache_size = 104857600" to ceph.conf, and am

retesting. I will see if OSD problems occur, and report back.

As to loading the cluster, I run an rsync job on each node, pulling data

from an NFS mounted Isilon. A single node pulls ~200MB/s, with all 7

nodes running, the ceph -w reports between 700 > 1500MB/s writes.

as requested, here is my "restart_OSD_and_log-this.sh" script:

************************************************************************

#!/bin/bash

# catches single failed OSDs, log and restart

while : ; do

        OSD=`ceph osd tree 2> /dev/null | grep down | \

        awk '{ print $3}' | awk -F "." '{print $2 }'`

if [ "$OSD" != "" ] ; then

        DATE=`date`

        echo $DATE " restart OSD " $OSD  >> /root/osd_restart_log

        echo "OSD" $OSD "is down, restarting.."

        OSDHOST=`ceph osd find $OSD | grep host | awk -F '"' '{print $4}'`

        ssh $OSDHOST systemctl restart ceph-osd@$OSD

        sleep 30

else

        echo -ne "\r\033[k"

        echo -ne "all OSD OK"

fi

        sleep 1

done

************************************************************************

thanks again,

Jake

On 08/06/17 12:08, nokia ceph wrote:

> Hello Mark,

>

> Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222

>

> Jake can you share the restart_OSD_and_log-this.sh script

>

> Thanks

> Jayaram

>

> On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx

> <mailto:jog@xxxxxxxxxxxxxxxxx>> wrote:

>

>     Hi Mark & List,

>

>     Unfortunately, even when using yesterdays master version of ceph,

>     I'm still seeing OSDs go down, same error as before:

>

>     OSD log shows lots of entries like this:

>

>     (osd38)

>     2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy

>     'tp_osd_tp thread tp_osd_tp' had timed out after 60

>

>     (osd3)

>     2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy

>     'tp_osd_tp thread tp_osd_tp' had timed out after 60

>     2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:

>     no reply from 10.1.0.86:6811 <http://10.1.0.86:6811> osd.2 since

>     back 2017-06-07 17:00:19.640002

>     front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)

>

>

>     [root@ceph4 ceph]# ceph -v

>     ceph version 12.0.2-2399-ge38ca14

>     (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)

>

>

>     I'll continue running the cluster with my "restart_OSD_and_log-this.sh"

>     workaround...

>

>     thanks again for your help,

>

>     Jake

>

>     On 06/06/17 15:52, Jake Grimmett wrote:

>     > Hi Mark,

>     >

>     > OK, I'll upgrade to the current master and retest...

>     >

>     > best,

>     >

>     > Jake

>     >

>     > On 06/06/17 15:46, Mark Nelson wrote:

>     >> Hi Jake,

>     >>

>     >> I just happened to notice this was on 12.0.3.  Would it be

>     possible to

>     >> test this out with current master and see if it still is a problem?

>     >>

>     >> Mark

>     >>

>     >> On 06/06/2017 09:10 AM, Mark Nelson wrote:

>     >>> Hi Jake,

>     >>>

>     >>> Thanks much.  I'm guessing at this point this is probably a

>     bug.  Would

>     >>> you (or nokiauser) mind creating a bug in the tracker with a short

>     >>> description of what's going on and the collectl sample showing

>     this is

>     >>> not IOs backing up on the disk?

>     >>>

>     >>> If you want to try it, we have a gdb based wallclock profiler

>     that might

>     >>> be interesting to run while it's in the process of timing out.

>     It tries

>     >>> to grab 2000 samples from the osd process which typically takes

>     about 10

>     >>> minutes or so.  You'll need to either change the number of

>     samples to be

>     >>> lower in the python code (maybe like 50-100), or change the

>     timeout to

>     >>> be something longer.

>     >>>

>     >>> You can find the code here:

>     >>>

>     >>> https://github.com/markhpc/gdbprof

>     <https://github.com/markhpc/gdbprof>

>     >>>

>     >>> and invoke it like:

>     >>>

>     >>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source

>     >>> ./gdbprof.py' -ex 'profile begin' -ex 'quit'

>     >>>

>     >>> where 27962 in this case is the PID of the ceph-osd process.  You'll

>     >>> need gdb with the python bindings and the ceph debug symbols for

>     it to

>     >>> work.

>     >>>

>     >>> This might tell us over time if the tp_osd_tp processes are just

>     sitting

>     >>> on pg::locks.

>     >>>

>     >>> Mark

>     >>>

>     >>> On 06/06/2017 05:34 AM, Jake Grimmett wrote:

>     >>>> Hi Mark,

>     >>>>

>     >>>> Thanks again for looking into this problem.

>     >>>>

>     >>>> I ran the cluster overnight, with a script checking for dead

>     OSDs every

>     >>>> second, and restarting them.

>     >>>>

>     >>>> 40 OSD failures occurred in 12 hours, some OSDs failed multiple

>     times,

>     >>>> (there are 50 OSDs in the EC tier).

>     >>>>

>     >>>> Unfortunately, the output of collectl doesn't appear to show any

>     >>>> increase in disk queue depth and service times before the OSDs die.

>     >>>>

>     >>>> I've put a couple of examples of collectl output for the disks

>     >>>> associated with the OSDs here:

>     >>>>

>     >>>> https://hastebin.com/icuvotemot.scala

>     <https://hastebin.com/icuvotemot.scala>

>     >>>>

>     >>>> please let me know if you need more info...

>     >>>>

>     >>>> best regards,

>     >>>>

>     >>>> Jake

>     >>>>

>     >>>>

>     >

>     _______________________________________________

>     ceph-users mailing list

>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>

>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com