Re: data corruption with hammer

Sage Weil <sweil@xxxxxxxxxx> · Thu, 17 Mar 2016 08:26:01 -0400 (EDT)

On Thu, 17 Mar 2016, Nick Fisk wrote:
> There is got to be something else going on here. All that PR does is to 
> potentially delay the promotion to hit_set_period*recency instead of 
> just doing it on the 2nd read regardless, it's got to be uncovering 
> another bug.
> 
> Do you see the same problem if the cache is in writeback mode before you 
> start the unpacking. Ie is it the switching mid operation which causes 
> the problem? If it only happens mid operation, does it still occur if 
> you pause IO when you make the switch?
> 
> Do you also see this if you perform on a RBD mount, to rule out any 
> librbd/qemu weirdness?
> 
> Do you know if it’s the actual data that is getting corrupted or if it's 
> the FS metadata? I'm only wondering as unpacking should really only be 
> writing to each object a couple of times, whereas FS metadata could 
> potentially be being updated+read back lots of times for the same group 
> of objects and ordering is very important.
> 
> Thinking through it logically the only difference is that with recency=1 
> the object will be copied up to the cache tier, where recency=6 it will 
> be proxy read for a long time. If I had to guess I would say the issue 
> would lie somewhere in the proxy read + writeback<->forward logic.

That seems reasonable.  Was switching from writeback -> forward always 
part of the sequence that resulted in corruption?  Not that there is a 
known ordering issue when switching to forward mode.  I wouldn't really 
expect it to bite real users but it's possible..

	http://tracker.ceph.com/issues/12814

I've opened a ticket to track this:

	http://tracker.ceph.com/issues/15171

What would be *really* great is if you could reproduce this with a 
ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados 
running, and then find the sequence of operations that are sufficient to 
trigger a failure.

sage

 > 
> 
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > Mike Lovell
> > Sent: 16 March 2016 23:23
> > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>; sweil@xxxxxxxxxx
> > Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; William Perkins
> > <william.perkins@xxxxxxxxxxxxx>
> > Subject: Re:  data corruption with hammer
> > 
> > just got done with a test against a build of 0.94.6 minus the two commits that
> > were backported in PR 7207. everything worked as it should with the cache-
> > mode set to writeback and the min_read_recency_for_promote set to 2.
> > assuming it works properly on master, there must be a commit that we're
> > missing on the backport to support this properly.
> > 
> > sage,
> > i'm adding you to the recipients on this so hopefully you see it. the tl;dr
> > version is that the backport of the cache recency fix to hammer doesn't work
> > right and potentially corrupts data when
> > the min_read_recency_for_promote is set to greater than 1.
> > 
> > mike
> > 
> > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > <mike.lovell@xxxxxxxxxxxxx> wrote:
> > robert and i have done some further investigation the past couple days on
> > this. we have a test environment with a hard drive tier and an ssd tier as a
> > cache. several vms were created with volumes from the ceph cluster. i did a
> > test in each guest where i un-tarred the linux kernel source multiple times
> > and then did a md5sum check against all of the files in the resulting source
> > tree. i started off with the monitors and osds running 0.94.5 and never saw
> > any problems.
> > 
> > a single node was then upgraded to 0.94.6 which has osds in both the ssd and
> > hard drive tier. i then proceeded to run the same test and, while the untar
> > and md5sum operations were running, i changed the ssd tier cache-mode
> > from forward to writeback. almost immediately the vms started reporting io
> > errors and odd data corruption. the remainder of the cluster was updated to
> > 0.94.6, including the monitors, and the same thing happened.
> > 
> > things were cleaned up and reset and then a test was run
> > where min_read_recency_for_promote for the ssd cache pool was set to 1.
> > we previously had it set to 6. there was never an error with the recency
> > setting set to 1. i then tested with it set to 2 and it immediately caused
> > failures. we are currently thinking that it is related to the backport of the fix
> > for the recency promotion and are in progress of making a .6 build without
> > that backport to see if we can cause corruption. is anyone using a version
> > from after the original recency fix (PR 6702) with a cache tier in writeback
> > mode? anyone have a similar problem?
> > 
> > mike
> > 
> > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell
> > <mike.lovell@xxxxxxxxxxxxx> wrote:
> > something weird happened on one of the ceph clusters that i administer
> > tonight which resulted in virtual machines using rbd volumes seeing
> > corruption in multiple forms.
> > 
> > when everything was fine earlier in the day, the cluster was a number of
> > storage nodes spread across 3 different roots in the crush map. the first
> > bunch of storage nodes have both hard drives and ssds in them with the hard
> > drives in one root and the ssds in another. there is a pool for each and the
> > pool for the ssds is a cache tier for the hard drives. the last set of storage
> > nodes were in a separate root with their own pool that is being used for burn
> > in testing.
> > 
> > these nodes had run for a while with test traffic and we decided to move
> > them to the main root and pools. the main cluster is running 0.94.5 and the
> > new nodes got 0.94.6 due to them getting configured after that was
> > released. i removed the test pool and did a ceph osd crush move to move
> > the first node into the main cluster, the hard drives into the root for that tier
> > of storage and the ssds into the root and pool for the cache tier. each set was
> > done about 45 minutes apart and they ran for a couple hours while
> > performing backfill without any issue other than high load on the cluster.
> > 
> > we normally run the ssd tier in the forward cache-mode due to the ssds we
> > have not being able to keep up with the io of writeback. this results in io on
> > the hard drives slowing going up and performance of the cluster starting to
> > suffer. about once a week, i change the cache-mode between writeback and
> > forward for short periods of time to promote actively used data to the cache
> > tier. this moves io load from the hard drive tier to the ssd tier and has been
> > done multiple times without issue. i normally don't do this while there are
> > backfills or recoveries happening on the cluster but decided to go ahead
> > while backfill was happening due to the high load.
> > 
> > i tried this procedure to change the ssd cache-tier between writeback and
> > forward cache-mode and things seemed okay from the ceph cluster. about
> > 10 minutes after the first attempt a changing the mode, vms using the ceph
> > cluster for their storage started seeing corruption in multiple forms. the
> > mode was flipped back and forth multiple times in that time frame and its
> > unknown if the corruption was noticed with the first change or subsequent
> > changes. the vms were having issues of filesystems having errors and getting
> > remounted RO and mysql databases seeing corruption (both myisam and
> > innodb). some of this was recoverable but on some filesystems there was
> > corruption that lead to things like lots of data ending up in the lost+found and
> > some of the databases were un-recoverable (backups are helping there).
> > 
> > i'm not sure what would have happened to cause this corruption. the libvirt
> > logs for the qemu processes for the vms did not provide any output of
> > problems from the ceph client code. it doesn't look like any of the qemu
> > processes had crashed. also, it has now been several hours since this
> > happened with no additional corruption noticed by the vms. it doesn't
> > appear that we had any corruption happen before i attempted the flipping of
> > the ssd tier cache-mode.
> > 
> > the only think i can think of that is different between this time doing this
> > procedure vs previous attempts was that there was the one storage node
> > running 0.94.6 where the remainder were running 0.94.5. is is possible that
> > something changed between these two releases that would have caused
> > problems with data consistency related to the cache tier? or otherwise? any
> > other thoughts or suggestions?
> > 
> > thanks in advance for any help you can provide.
> > 
> > mike
> > 
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com