Re: data corruption with hammer

Irek Fasikhov <malmyzh@xxxxxxxxx> · Thu, 17 Mar 2016 16:14:12 +0300

Hi, Nick
I switched between forward and writeback. (forward -> writeback)

С уважением, Фасихов Ирек НургаязовичМоб.: +79229045757

2016-03-17 16:10 GMT+03:00 Nick Fisk <nick@xxxxxxxxxx>:
> -----Original Message-----

> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of

> Irek Fasikhov

> Sent: 17 March 2016 13:00

> To: Sage Weil <sweil@xxxxxxxxxx>

> Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; ceph-users <ceph-

> users@xxxxxxxxxxxxxx>; Nick Fisk <nick@xxxxxxxxxx>; William Perkins

> <william.perkins@xxxxxxxxxxxxx>

> Subject: Re:  data corruption with hammer

>

> Hi,All.

>

> I confirm the problem. When min_read_recency_for_promote> 1 data

> failure.

But what scenario is this? Are you switching between forward and writeback, or just running in writeback?

>

>

> С уважением, Фасихов Ирек Нургаязович

> Моб.: +79229045757

>

> 2016-03-17 15:26 GMT+03:00 Sage Weil <sweil@xxxxxxxxxx>:

> On Thu, 17 Mar 2016, Nick Fisk wrote:

> > There is got to be something else going on here. All that PR does is to

> > potentially delay the promotion to hit_set_period*recency instead of

> > just doing it on the 2nd read regardless, it's got to be uncovering

> > another bug.

> >

> > Do you see the same problem if the cache is in writeback mode before you

> > start the unpacking. Ie is it the switching mid operation which causes

> > the problem? If it only happens mid operation, does it still occur if

> > you pause IO when you make the switch?

> >

> > Do you also see this if you perform on a RBD mount, to rule out any

> > librbd/qemu weirdness?

> >

> > Do you know if it’s the actual data that is getting corrupted or if it's

> > the FS metadata? I'm only wondering as unpacking should really only be

> > writing to each object a couple of times, whereas FS metadata could

> > potentially be being updated+read back lots of times for the same group

> > of objects and ordering is very important.

> >

> > Thinking through it logically the only difference is that with recency=1

> > the object will be copied up to the cache tier, where recency=6 it will

> > be proxy read for a long time. If I had to guess I would say the issue

> > would lie somewhere in the proxy read + writeback<->forward logic.

>

> That seems reasonable.  Was switching from writeback -> forward always

> part of the sequence that resulted in corruption?  Not that there is a

> known ordering issue when switching to forward mode.  I wouldn't really

> expect it to bite real users but it's possible..

>

>         http://tracker.ceph.com/issues/12814

>

> I've opened a ticket to track this:

>

>         http://tracker.ceph.com/issues/15171

>

> What would be *really* great is if you could reproduce this with a

> ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados

> running, and then find the sequence of operations that are sufficient to

> trigger a failure.

>

> sage

>

>

>

>  >

> >

> >

> > > -----Original Message-----

> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On

> Behalf Of

> > > Mike Lovell

> > > Sent: 16 March 2016 23:23

> > > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>; sweil@xxxxxxxxxx

> > > Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; William Perkins

> > > <william.perkins@xxxxxxxxxxxxx>

> > > Subject: Re:  data corruption with hammer

> > >

> > > just got done with a test against a build of 0.94.6 minus the two commits

> that

> > > were backported in PR 7207. everything worked as it should with the

> cache-

> > > mode set to writeback and the min_read_recency_for_promote set to 2.

> > > assuming it works properly on master, there must be a commit that we're

> > > missing on the backport to support this properly.

> > >

> > > sage,

> > > i'm adding you to the recipients on this so hopefully you see it. the tl;dr

> > > version is that the backport of the cache recency fix to hammer doesn't

> work

> > > right and potentially corrupts data when

> > > the min_read_recency_for_promote is set to greater than 1.

> > >

> > > mike

> > >

> > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell

> > > <mike.lovell@xxxxxxxxxxxxx> wrote:

> > > robert and i have done some further investigation the past couple days

> on

> > > this. we have a test environment with a hard drive tier and an ssd tier as a

> > > cache. several vms were created with volumes from the ceph cluster. i

> did a

> > > test in each guest where i un-tarred the linux kernel source multiple

> times

> > > and then did a md5sum check against all of the files in the resulting

> source

> > > tree. i started off with the monitors and osds running 0.94.5 and never

> saw

> > > any problems.

> > >

> > > a single node was then upgraded to 0.94.6 which has osds in both the ssd

> and

> > > hard drive tier. i then proceeded to run the same test and, while the

> untar

> > > and md5sum operations were running, i changed the ssd tier cache-mode

> > > from forward to writeback. almost immediately the vms started reporting

> io

> > > errors and odd data corruption. the remainder of the cluster was updated

> to

> > > 0.94.6, including the monitors, and the same thing happened.

> > >

> > > things were cleaned up and reset and then a test was run

> > > where min_read_recency_for_promote for the ssd cache pool was set to

> 1.

> > > we previously had it set to 6. there was never an error with the recency

> > > setting set to 1. i then tested with it set to 2 and it immediately caused

> > > failures. we are currently thinking that it is related to the backport of the

> fix

> > > for the recency promotion and are in progress of making a .6 build

> without

> > > that backport to see if we can cause corruption. is anyone using a version

> > > from after the original recency fix (PR 6702) with a cache tier in writeback

> > > mode? anyone have a similar problem?

> > >

> > > mike

> > >

> > > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell

> > > <mike.lovell@xxxxxxxxxxxxx> wrote:

> > > something weird happened on one of the ceph clusters that i administer

> > > tonight which resulted in virtual machines using rbd volumes seeing

> > > corruption in multiple forms.

> > >

> > > when everything was fine earlier in the day, the cluster was a number of

> > > storage nodes spread across 3 different roots in the crush map. the first

> > > bunch of storage nodes have both hard drives and ssds in them with the

> hard

> > > drives in one root and the ssds in another. there is a pool for each and the

> > > pool for the ssds is a cache tier for the hard drives. the last set of storage

> > > nodes were in a separate root with their own pool that is being used for

> burn

> > > in testing.

> > >

> > > these nodes had run for a while with test traffic and we decided to move

> > > them to the main root and pools. the main cluster is running 0.94.5 and

> the

> > > new nodes got 0.94.6 due to them getting configured after that was

> > > released. i removed the test pool and did a ceph osd crush move to move

> > > the first node into the main cluster, the hard drives into the root for that

> tier

> > > of storage and the ssds into the root and pool for the cache tier. each set

> was

> > > done about 45 minutes apart and they ran for a couple hours while

> > > performing backfill without any issue other than high load on the cluster.

> > >

> > > we normally run the ssd tier in the forward cache-mode due to the ssds

> we

> > > have not being able to keep up with the io of writeback. this results in io

> on

> > > the hard drives slowing going up and performance of the cluster starting

> to

> > > suffer. about once a week, i change the cache-mode between writeback

> and

> > > forward for short periods of time to promote actively used data to the

> cache

> > > tier. this moves io load from the hard drive tier to the ssd tier and has

> been

> > > done multiple times without issue. i normally don't do this while there are

> > > backfills or recoveries happening on the cluster but decided to go ahead

> > > while backfill was happening due to the high load.

> > >

> > > i tried this procedure to change the ssd cache-tier between writeback

> and

> > > forward cache-mode and things seemed okay from the ceph cluster.

> about

> > > 10 minutes after the first attempt a changing the mode, vms using the

> ceph

> > > cluster for their storage started seeing corruption in multiple forms. the

> > > mode was flipped back and forth multiple times in that time frame and its

> > > unknown if the corruption was noticed with the first change or

> subsequent

> > > changes. the vms were having issues of filesystems having errors and

> getting

> > > remounted RO and mysql databases seeing corruption (both myisam and

> > > innodb). some of this was recoverable but on some filesystems there

> was

> > > corruption that lead to things like lots of data ending up in the lost+found

> and

> > > some of the databases were un-recoverable (backups are helping there).

> > >

> > > i'm not sure what would have happened to cause this corruption. the

> libvirt

> > > logs for the qemu processes for the vms did not provide any output of

> > > problems from the ceph client code. it doesn't look like any of the qemu

> > > processes had crashed. also, it has now been several hours since this

> > > happened with no additional corruption noticed by the vms. it doesn't

> > > appear that we had any corruption happen before i attempted the

> flipping of

> > > the ssd tier cache-mode.

> > >

> > > the only think i can think of that is different between this time doing this

> > > procedure vs previous attempts was that there was the one storage

> node

> > > running 0.94.6 where the remainder were running 0.94.5. is is possible

> that

> > > something changed between these two releases that would have caused

> > > problems with data consistency related to the cache tier? or otherwise?

> any

> > > other thoughts or suggestions?

> > >

> > > thanks in advance for any help you can provide.

> > >

> > > mike

> > >

> >

> >

> >

> >

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com