Re: Possible corruption with MSSQL on RBD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 22 Oct 2016, Dan van der Ster wrote:
> On Sat, Oct 22, 2016 at 5:34 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >
> >> Op 22 oktober 2016 om 17:30 schreef Dan van der Ster <dan@xxxxxxxxxxxxxx>:
> >>
> >>
> >> On Fri, Oct 14, 2016 at 12:04 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> >
> >> >> Op 12 oktober 2016 om 17:57 schreef Wido den Hollander <wido@xxxxxxxx>:
> >> >>
> >> >>
> >> >>
> >> >> > Op 12 oktober 2016 om 17:01 schreef Gregory Farnum <gfarnum@xxxxxxxxxx>:
> >> >> >
> >> >> >
> >> >> > On Wed, Oct 12, 2016 at 7:57 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> >> > > Hi,
> >> >> > >
> >> >> > > I filed a issue in the tracker, but I'm looking for some feedback to diagnose this a bit further: http://tracker.ceph.com/issues/17545
> >> >> > >
> >> >> > > The situation is that with a Firefly or Hammer (haven't tested Jewel yet) a MSSQL server running on RBD will sometimes complain about corruption.
> >> >> > >
> >> >> > > Using SQLioSim we can reproduce the issue on a small Proxmox + Ceph cluster and after an hour or so it will yield:
> >> >> > >
> >> >> > > Expected FileId: 0x0
> >> >> > > Received FileId: 0x0
> >> >> > > Expected PageId: 0xCB19C
> >> >> > > Received PageId: 0xCB19A (does not match expected)
> >> >> > > Received CheckSum: 0x9F444071
> >> >> > > Calculated CheckSum: 0x89603EC9 (does not match expected)
> >> >> > > Received Buffer Length: 0x2000
> >> >> > >
> >> >> > > The issue only seems to happen with RBD caching enabled. When disabling the RBD cache or using cache=directsync we were not able to reproduce the issue.
> >> >> > >
> >> >> > > When using LVM/file based backends for Qemu the problem also didn't pop up.
> >> >> > >
> >> >> > > So this seems to be either a librbd issue or the RBD driver inside Qemu.
> >> >> > >
> >> >> > > Any hints on how to debug this further to find the root cause?
> >> >> >
> >> >> > If you've got control over the clients, try building with commit
> >> >> > 9ec6e7f608608088d51e449c9d375844631dcdde backported to them (I believe
> >> >> > it's also in the latest Hammer release, but maybe there hasn't been
> >> >> > one cut since the backport?); tracked at
> >> >> > http://tracker.ceph.com/issues/16002 but of course the web site is
> >> >> > dead so you can't look at that right now. :(
> >> >>
> >> >> I verified, but 9ec6e7 is not in v0.94.9, but it is in the Jewel release.
> >> >>
> >> >> Tests are running with Jewel now and I will probably have the results tomorrow. If Jewel doesn't break the commit you send might indeed resolve it.
> >> >>
> >> >
> >> > The tests have been running for over 24 hours and all still looks good. We will let it run for over the weekend to make sure it has been fixed.
> >>
> >> Is there still some follow up needed on this? It seems that latest
> >> hammer is still at risk for an RBD corruption bug, right?
> >>
> >> I've understood that you don't see the problem any more with jewel,
> >> but I didn't see any confirmation that 9ec6e7 is indeed the fix. Did
> >> you run any more testing of hammer + 9ec6e7?
> >>
> >
> > Oh, I updated the issues and not the ML. Yes, with hammer + that fix we no longer see the issue.
> >
> > The fix is now pending for 0.94.10
> 
> Excellent.
> 
> So, I don't really understand how this corruption is triggered. Is it
> something unique about the MSSQL workload that triggers it or is this
> a general problem that's probably going on undetected by other RBD
> use-cases?

It could affect other workloads as well.  You'll definitely want to 
upgrade once 0.94.10 is out.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux