Re: OSD goes up and down - osd suicide?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Greg,
I didn't do anything in that logging period, and no clients were connected to the cluster. That's the log generated from the "starting deep scrub" moment to the "wrongly marked out"

Yesterday I tried to upgrade osd to 0.57, but nothing has changed.

So I deleted the whole osd.0 to force a rebuild from scratch, so I waited to reach a full "active+clean" state (I mean just before the start deep scrub moment), stopped all the osd, mkfs.xfs and all the steps to add the osd to the cluster - then I've restarted all the osd
The cluster started to repopulate the device and i thought I was on the way, but... this morning i've found the ceph-osd process crashed and the device *full* (note that other devices are full at 70%)
/dev/sda1       1.9T  1.8T   11G 100% /var/lib/ceph/osd/ceph-0
Then,
I increased the log verbosity and restarted the osd again. The log is here: https://docs.google.com/file/d/0B1lZcgrNMBAJSU1Nc0NSMjdnYU0/edit?usp=sharing
Immediately I've noticed that the used space dropped to 64%
/dev/sda1       1.9T  1.2T  673G  64% /var/lib/ceph/osd/ceph-0

So the osd is still getting full [of what?] and after 40 minutes starts to log only this line every 5 seconds:
2013-03-06 15:15:47.311447 7f0d55185700  0 -- 192.168.21.134:6808/16648 >> 192.168.21.134:6828/20837 pipe(0x18252780 sd=24 :46533 s=1 pgs=4821 cs=4 l=0).connect claims to be 192.168.21.134:6828/20733 not 192.168.21.134:6828/20837 - wrong node!

Hope this helps


2013/3/6 Greg Farnum <greg@xxxxxxxxxxx>
On Tuesday, March 5, 2013 at 5:53 AM, Marco Aroldi wrote:
> Hi,
> I've collected a osd log with these parameters:
>
> debug osd = 20
> debug ms = 1
> debug filestore = 20
>
> You can download it from here:
> https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing
>
> I have also captured a video to show the behavior in realtime:http://youtu.be/708AI8PGy7k
>
Ah, this is interesting — the ceph-osd processes are using up the time, not the filesystem or something. However, I don't see any reason for that in a brief look at the OSD log here — can you describe what you did to the OSD during that logging period? (In particular I see a lot of pg_log messages, but not the sub op messages that would be associated with this OSD doing a deep scrub, nor the internal heartbeat timeouts that the other OSDs were generating.)
-Greg


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux