Greg,
I didn't do anything in that logging period, and no
clients were connected to the cluster. That's the log generated from the
"starting deep scrub" moment to the "wrongly marked out"Yesterday I tried to upgrade osd to 0.57, but nothing has changed.
So
I deleted the whole osd.0 to force a rebuild from scratch, so I waited
to reach a full "active+clean" state (I mean just before the start deep
scrub moment), stopped all the osd, mkfs.xfs and all the steps to add
the osd to the cluster - then I've restarted all the osd
The cluster started to repopulate the device and i thought
I was on the way, but... this morning i've found the ceph-osd process
crashed and the device *full* (note that other devices are full at 70%)
/dev/sda1 1.9T 1.8T 11G 100% /var/lib/ceph/osd/ceph-0
/dev/sda1 1.9T 1.8T 11G 100% /var/lib/ceph/osd/ceph-0
The log of the crash is here: https://docs.google.com/file/d/0B1lZcgrNMBAJaEpWT2hLemRwNEE/edit?usp=sharing
Then,
I increased the log verbosity and restarted the osd again. The log is here: https://docs.google.com/file/d/0B1lZcgrNMBAJSU1Nc0NSMjdnYU0/edit?usp=sharing
Immediately I've noticed that the used space dropped to 64%
/dev/sda1 1.9T 1.2T 673G 64% /var/lib/ceph/osd/ceph-0
/dev/sda1 1.9T 1.2T 673G 64% /var/lib/ceph/osd/ceph-0
So the osd is still getting full [of what?] and after 40 minutes starts to log only this line every 5 seconds:
2013-03-06 15:15:47.311447 7f0d55185700 0 -- 192.168.21.134:6808/16648 >> 192.168.21.134:6828/20837 pipe(0x18252780 sd=24 :46533 s=1 pgs=4821 cs=4 l=0).connect claims to be 192.168.21.134:6828/20733 not 192.168.21.134:6828/20837 - wrong node!
2013-03-06 15:15:47.311447 7f0d55185700 0 -- 192.168.21.134:6808/16648 >> 192.168.21.134:6828/20837 pipe(0x18252780 sd=24 :46533 s=1 pgs=4821 cs=4 l=0).connect claims to be 192.168.21.134:6828/20733 not 192.168.21.134:6828/20837 - wrong node!
2013/3/6 Greg Farnum <greg@xxxxxxxxxxx>
Ah, this is interesting — the ceph-osd processes are using up the time, not the filesystem or something. However, I don't see any reason for that in a brief look at the OSD log here — can you describe what you did to the OSD during that logging period? (In particular I see a lot of pg_log messages, but not the sub op messages that would be associated with this OSD doing a deep scrub, nor the internal heartbeat timeouts that the other OSDs were generating.)On Tuesday, March 5, 2013 at 5:53 AM, Marco Aroldi wrote:
> Hi,
> I've collected a osd log with these parameters:
>
> debug osd = 20
> debug ms = 1
> debug filestore = 20
>
> You can download it from here:
> https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing
>
> I have also captured a video to show the behavior in realtime:http://youtu.be/708AI8PGy7k
>
-Greg
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com