Re: OSD goes up and down - osd suicide?

Marco Aroldi <marco.aroldi@xxxxxxxxx> · Wed, 6 Mar 2013 15:25:11 +0100

Greg,
I didn't do anything in that logging period, and no 
clients were connected to the cluster. That's the log generated from the
 "starting deep scrub" moment to the "wrongly marked out"

Yesterday I tried to upgrade osd to 0.57, but nothing has changed.

So
 I deleted the whole osd.0 to force a rebuild from scratch, so I waited 
to reach a full "active+clean" state (I mean just before the start deep 
scrub moment), stopped all the osd, mkfs.xfs and all the steps to add 
the osd to the cluster - then I've restarted all the osd

The cluster started to repopulate the device and i thought
 I was on the way, but... this morning i've found the ceph-osd process 
crashed and the device *full* (note that other devices are full at 70%)

/dev/sda1       1.9T  1.8T   11G 100% /var/lib/ceph/osd/ceph-0
The log of the crash is here: https://docs.google.com/file/d/0B1lZcgrNMBAJaEpWT2hLemRwNEE/edit?usp=sharing

Then, 
I increased the log verbosity and restarted the osd again. The log is here: https://docs.google.com/file/d/0B1lZcgrNMBAJSU1Nc0NSMjdnYU0/edit?usp=sharing

Immediately I've noticed that the used space dropped to 64%
/dev/sda1       1.9T  1.2T  673G  64% /var/lib/ceph/osd/ceph-0

So the osd is still getting full [of what?] and after 40 minutes starts to log only this line every 5 seconds:

2013-03-06 15:15:47.311447 7f0d55185700  0 -- 192.168.21.134:6808/16648 >> 192.168.21.134:6828/20837 pipe(0x18252780 sd=24 :46533 s=1 pgs=4821 cs=4 l=0).connect claims to be 192.168.21.134:6828/20733 not 192.168.21.134:6828/20837 - wrong node!

Hope this helps

2013/3/6 Greg Farnum <greg@xxxxxxxxxxx>

On Tuesday, March 5, 2013 at 5:53 AM, Marco Aroldi wrote:

> Hi,

> I've collected a osd log with these parameters:

>

> debug osd = 20

> debug ms = 1

> debug filestore = 20

>

> You can download it from here:

> https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing

>

> I have also captured a video to show the behavior in realtime:http://youtu.be/708AI8PGy7k

>

Ah, this is interesting — the ceph-osd processes are using up the time, not the filesystem or something. However, I don't see any reason for that in a brief look at the OSD log here — can you describe what you did to the OSD during that logging period? (In particular I see a lot of pg_log messages, but not the sub op messages that would be associated with this OSD doing a deep scrub, nor the internal heartbeat timeouts that the other OSDs were generating.)

-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com