Re: Possible corruption with MSSQL on RBD

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Sat, 22 Oct 2016 17:30:55 +0200

On Fri, Oct 14, 2016 at 12:04 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>
>> Op 12 oktober 2016 om 17:57 schreef Wido den Hollander <wido@xxxxxxxx>:
>>
>>
>>
>> > Op 12 oktober 2016 om 17:01 schreef Gregory Farnum <gfarnum@xxxxxxxxxx>:
>> >
>> >
>> > On Wed, Oct 12, 2016 at 7:57 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> > > Hi,
>> > >
>> > > I filed a issue in the tracker, but I'm looking for some feedback to diagnose this a bit further: http://tracker.ceph.com/issues/17545
>> > >
>> > > The situation is that with a Firefly or Hammer (haven't tested Jewel yet) a MSSQL server running on RBD will sometimes complain about corruption.
>> > >
>> > > Using SQLioSim we can reproduce the issue on a small Proxmox + Ceph cluster and after an hour or so it will yield:
>> > >
>> > > Expected FileId: 0x0
>> > > Received FileId: 0x0
>> > > Expected PageId: 0xCB19C
>> > > Received PageId: 0xCB19A (does not match expected)
>> > > Received CheckSum: 0x9F444071
>> > > Calculated CheckSum: 0x89603EC9 (does not match expected)
>> > > Received Buffer Length: 0x2000
>> > >
>> > > The issue only seems to happen with RBD caching enabled. When disabling the RBD cache or using cache=directsync we were not able to reproduce the issue.
>> > >
>> > > When using LVM/file based backends for Qemu the problem also didn't pop up.
>> > >
>> > > So this seems to be either a librbd issue or the RBD driver inside Qemu.
>> > >
>> > > Any hints on how to debug this further to find the root cause?
>> >
>> > If you've got control over the clients, try building with commit
>> > 9ec6e7f608608088d51e449c9d375844631dcdde backported to them (I believe
>> > it's also in the latest Hammer release, but maybe there hasn't been
>> > one cut since the backport?); tracked at
>> > http://tracker.ceph.com/issues/16002 but of course the web site is
>> > dead so you can't look at that right now. :(
>>
>> I verified, but 9ec6e7 is not in v0.94.9, but it is in the Jewel release.
>>
>> Tests are running with Jewel now and I will probably have the results tomorrow. If Jewel doesn't break the commit you send might indeed resolve it.
>>
>
> The tests have been running for over 24 hours and all still looks good. We will let it run for over the weekend to make sure it has been fixed.

Is there still some follow up needed on this? It seems that latest
hammer is still at risk for an RBD corruption bug, right?

I've understood that you don't see the problem any more with jewel,
but I didn't see any confirmation that 9ec6e7 is indeed the fix. Did
you run any more testing of hammer + 9ec6e7?

Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html