Hello, Actually after going through the changelogs with a fine comb and the ole Mark I eyeball I think I might be seeing this: --- osd: fix journal direct-io shutdown (#9073 Mark Kirkwood, Ma Jianpeng, Somnath Roy) --- The details in the various related bug reports certainly make it look related. Funny that nobody involved in those bug reports noticed the similarity. Now I wouldn't have installed 0.80.8 due to the regression speed bug anyway, but now that 0.80.9 has made it into Jessie backports I shall install that tomorrow and hopefully never see that problem again. Christian On Thu, 28 May 2015 07:01:15 -0700 Gregory Farnum wrote: > On Thu, May 28, 2015 at 12:22 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > Hello Greg, > > > > On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: > > > >> The description of the logging abruptly ending and the journal being > >> bad really sounds like part of the disk is going back in time. I'm not > >> sure if XFS internally is set up in such a way that something like > >> losing part of its journal would allow that? > >> > > I'm special. ^o^ > > No XFS, EXT4. As stated in the original thread, below. > > And the (OSD) journal is a raw partition on a DC S3700. > > > > And since there was at least a 30 seconds pause between the completion > > of the "/etc/init.d/ceph stop" and issuing of the shutdown command, the > > logging abruptly ending seems to be unlikely related to the shutdown at > > all. > > Oh, sorry... > I happened to read this article last night: > http://lwn.net/SubscriberLink/645720/01149aa7c58954eb/ > > Depending on configuration (I think you'd need to have a > journal-as-file) you could be experiencing that. And again, not many > people use ext4 so who knows what other ways there are of things being > broken that nobody else has seen yet. > > > > >> If any of the OSD developers have the time it's conceivable a copy of > >> the OSD journal would be enlightening (if e.g. the header offsets are > >> wrong but there are a bunch of valid journal entries), but this is two > >> reports of this issue from you and none very similar from anybody > >> else. I'm still betting on something in the software or hardware stack > >> misbehaving. (There aren't that many people running Debian; there are > >> lots of people running Ubuntu and we find bad XFS kernels there not > >> infrequently; I think you're hitting something like that.) > >> > > There should be no file system involved with the raw partition SSD > > journal, n'est-ce pas? > > ...and I guess probably you aren't since you are using partitions. > > > > > The hardware is vastly different, the previous case was on an AMD > > system with onboard SATA (SP5100), this one is a SM storage goat with > > LSI 3008. > > > > The only thing they have in common is the Ceph version 0.80.7 (via the > > Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 > > (though there were minor updates on that between those incidents, > > backported fixes) > > > > A copy of the journal would consist of the entire 10GB partition, > > since we don't know where in loop it was at the time, right? > > Yeah. > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com