Re: OSD trashed by simple reboot (Debian Jessie, systemd?)

Christian Balzer <chibi@xxxxxxx> · Thu, 4 Jun 2015 15:57:17 +0900

Hello,

Actually after going through the changelogs with a fine comb and the ole
Mark I eyeball I think I might be seeing this:
---
osd: fix journal direct-io shutdown (#9073 Mark Kirkwood, Ma Jianpeng, Somnath Roy)
---

The details in the various related bug reports certainly make it look
related. 
Funny that nobody involved in those bug reports noticed the similarity. 

Now I wouldn't have installed 0.80.8 due to the regression speed bug
anyway, but now that 0.80.9 has made it into Jessie backports I shall
install that tomorrow and hopefully never see that problem again.

Christian

On Thu, 28 May 2015 07:01:15 -0700 Gregory Farnum wrote:

> On Thu, May 28, 2015 at 12:22 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello Greg,
> >
> > On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:
> >
> >> The description of the logging abruptly ending and the journal being
> >> bad really sounds like part of the disk is going back in time. I'm not
> >> sure if XFS internally is set up in such a way that something like
> >> losing part of its journal would allow that?
> >>
> > I'm special. ^o^
> > No XFS, EXT4. As stated in the original thread, below.
> > And the (OSD) journal is a raw partition on a DC S3700.
> >
> > And since there was at least a 30 seconds pause between the completion
> > of the "/etc/init.d/ceph stop" and issuing of the shutdown command, the
> > logging abruptly ending seems to be unlikely related to the shutdown at
> > all.
> 
> Oh, sorry...
> I happened to read this article last night:
> http://lwn.net/SubscriberLink/645720/01149aa7c58954eb/
> 
> Depending on configuration (I think you'd need to have a
> journal-as-file) you could be experiencing that. And again, not many
> people use ext4 so who knows what other ways there are of things being
> broken that nobody else has seen yet.
> 
> >
> >> If any of the OSD developers have the time it's conceivable a copy of
> >> the OSD journal would be enlightening (if e.g. the header offsets are
> >> wrong but there are a bunch of valid journal entries), but this is two
> >> reports of this issue from you and none very similar from anybody
> >> else. I'm still betting on something in the software or hardware stack
> >> misbehaving. (There aren't that many people running Debian; there are
> >> lots of people running Ubuntu and we find bad XFS kernels there not
> >> infrequently; I think you're hitting something like that.)
> >>
> > There should be no file system involved with the raw partition SSD
> > journal, n'est-ce pas?
> 
> ...and I guess probably you aren't since you are using partitions.
> 
> >
> > The hardware is vastly different, the previous case was on an AMD
> > system with onboard SATA (SP5100), this one is a SM storage goat with
> > LSI 3008.
> >
> > The only thing they have in common is the Ceph version 0.80.7 (via the
> > Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
> > (though there were minor updates on that between those incidents,
> > backported fixes)
> >
> > A copy of the journal would consist of the entire 10GB partition,
> > since we don't know where in loop it was at the time, right?
> 
> Yeah.
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com