Re: data corruption with hammer

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Thu, 17 Mar 2016 10:19:01 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'm having trouble finding documentation about using ceph_test_rados. Can I run this on the existing cluster and will that provide useful info? It seems running it in the build will not have the caching set up (vstart.sh).

I have accepted a job with another company and only have until Wednesday to help with getting information about this bug. My new job will not be using Ceph, so I won't be able to provide any additional info after Tuesday. I want to leave the company on a good trajectory for upgrading, so any input you can provide will be helpful.

I've found:

./ceph_test_rados --op read 100 --op write 100 --op delete 50
- --max-ops 400000 --objects 1024 --max-in-flight 64 --size 4000000
- --min-stride-size 400000 --max-stride-size 800000 --max-seconds 600
- --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op
rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0

Is that enough if I change --pool to the cached pool and do the toggling while ceph_test_rados is running? I think this will run for 10 minutes.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW6tjwCRDmVDuy+mK58QAANKgP/ia5TA/7kTUpmciVR2BW
t0MrilXAIvdikHlaWTVIxEmb4S8X+57hziEZUd6hLBMnKnuUQxsDb3yyuZX4
iqaE8KBXDjMFjHnhTOFf7eB2JIjM1WkZxmlA23yBRMNtvlBArbwxYYnAyTXt
/fW1QmgLZIvuql1y01TdRot/owqJ3B2Ah896lySrltWj626R+1rhTLVDWYr6
EKa1mf8BiRBeGpjEVhN6Vihb7T1IzHtCi1E6+mlSqhWGNf8AeZh8IKUT0tbm
C/JiUVGmG8/t7WFzCiQWd1w8UdkdCzms7k662CsSLIpbjNo4ouwEkpb5sZLP
ELgWxo8hvad7USqSXvXqJNzmoenUwQwdUvSjYbNk+4D+8eHqptlNXDmDfpiE
pN7dp8wbJ+yICxMPLuUe/Iqzp6rRnjPwam/CiDZu52N1ncH3X1X4u0cuAD0Z
dFjEfdAZJAJ+fqvts2zVvtOwq/q41eTuV3ZRSn5ubA6iAeKnxMtPoEcuozEp
Su1Iud2fYdma5w8MFStjp1BAV3osg1WgIM6KYzsSZI1BkCQAqU58ROZ0ZsMb
D05/AEK/A6fp0ROXUczhXDcXlXcGEWyJm1QEtg7cSu3C+9qu5qvQQxyrrwbZ
MK8C5lhVb44sqSVcSIZ+KCrPC+x8UKodDQZCz6O6NrJjZLn2g06583cMFWK8
qLo+
=qgB7
-----END PGP SIGNATURE-----

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Mar 17, 2016 at 8:19 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
On Thu, 17 Mar 2016, Robert LeBlanc wrote:

> We are trying to figure out how to use rados bench to reproduce. Ceph

> itself doesn't seem to think there is any corruption, but when you do a

> verify inside the RBD, there is. Can rados bench verify the objects after

> they are written? It also seems to be primarily the filesystem metadata

> that is corrupted. If we fsck the volume, there is missing data (put into

> lost+found), but if it is there it is primarily OK. There only seems to be

> a few cases where a file's contents are corrupted. I would suspect on an

> object boundary. We would have to look at blockinfo to map that out and see

> if that is what is happening.

'rados bench' doesn't do validation.  ceph_test_rados does, though--if you

can reproduce with that workload then it should be pretty easy to track

down.

Thanks!

sage

> We stopped all the IO and did put the tier in writeback mode with recency

> 1,  set the recency to 2 and started the test and there was corruption, so

> it doesn't seem to be limited to changing the mode. I don't know how that

> patch could cause the issue either. Unless there is a bug that reads from

> the back tier, but writes to cache tier, then the object gets promoted

> wiping that last write, but then it seems like it should not be as much

> corruption since the metadata should be in the cache pretty quick. We

> usually evited the cache before each try so we should not be evicting on

> writeback.

>

> Sent from a mobile device, please excuse any typos.

> On Mar 17, 2016 6:26 AM, "Sage Weil" <sweil@xxxxxxxxxx> wrote:

>

> > On Thu, 17 Mar 2016, Nick Fisk wrote:

> > > There is got to be something else going on here. All that PR does is to

> > > potentially delay the promotion to hit_set_period*recency instead of

> > > just doing it on the 2nd read regardless, it's got to be uncovering

> > > another bug.

> > >

> > > Do you see the same problem if the cache is in writeback mode before you

> > > start the unpacking. Ie is it the switching mid operation which causes

> > > the problem? If it only happens mid operation, does it still occur if

> > > you pause IO when you make the switch?

> > >

> > > Do you also see this if you perform on a RBD mount, to rule out any

> > > librbd/qemu weirdness?

> > >

> > > Do you know if it’s the actual data that is getting corrupted or if it's

> > > the FS metadata? I'm only wondering as unpacking should really only be

> > > writing to each object a couple of times, whereas FS metadata could

> > > potentially be being updated+read back lots of times for the same group

> > > of objects and ordering is very important.

> > >

> > > Thinking through it logically the only difference is that with recency=1

> > > the object will be copied up to the cache tier, where recency=6 it will

> > > be proxy read for a long time. If I had to guess I would say the issue

> > > would lie somewhere in the proxy read + writeback<->forward logic.

> >

> > That seems reasonable.  Was switching from writeback -> forward always

> > part of the sequence that resulted in corruption?  Not that there is a

> > known ordering issue when switching to forward mode.  I wouldn't really

> > expect it to bite real users but it's possible..

> >

> >         http://tracker.ceph.com/issues/12814

> >

> > I've opened a ticket to track this:

> >

> >         http://tracker.ceph.com/issues/15171

> >

> > What would be *really* great is if you could reproduce this with a

> > ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados

> > running, and then find the sequence of operations that are sufficient to

> > trigger a failure.

> >

> > sage

> >

> >

> >

> >  >

> > >

> > >

> > > > -----Original Message-----

> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf

> > Of

> > > > Mike Lovell

> > > > Sent: 16 March 2016 23:23

> > > > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>; sweil@xxxxxxxxxx

> > > > Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; William Perkins

> > > > <william.perkins@xxxxxxxxxxxxx>

> > > > Subject: Re:  data corruption with hammer

> > > >

> > > > just got done with a test against a build of 0.94.6 minus the two

> > commits that

> > > > were backported in PR 7207. everything worked as it should with the

> > cache-

> > > > mode set to writeback and the min_read_recency_for_promote set to 2.

> > > > assuming it works properly on master, there must be a commit that we're

> > > > missing on the backport to support this properly.

> > > >

> > > > sage,

> > > > i'm adding you to the recipients on this so hopefully you see it. the

> > tl;dr

> > > > version is that the backport of the cache recency fix to hammer

> > doesn't work

> > > > right and potentially corrupts data when

> > > > the min_read_recency_for_promote is set to greater than 1.

> > > >

> > > > mike

> > > >

> > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell

> > > > <mike.lovell@xxxxxxxxxxxxx> wrote:

> > > > robert and i have done some further investigation the past couple days

> > on

> > > > this. we have a test environment with a hard drive tier and an ssd

> > tier as a

> > > > cache. several vms were created with volumes from the ceph cluster. i

> > did a

> > > > test in each guest where i un-tarred the linux kernel source multiple

> > times

> > > > and then did a md5sum check against all of the files in the resulting

> > source

> > > > tree. i started off with the monitors and osds running 0.94.5 and

> > never saw

> > > > any problems.

> > > >

> > > > a single node was then upgraded to 0.94.6 which has osds in both the

> > ssd and

> > > > hard drive tier. i then proceeded to run the same test and, while the

> > untar

> > > > and md5sum operations were running, i changed the ssd tier cache-mode

> > > > from forward to writeback. almost immediately the vms started

> > reporting io

> > > > errors and odd data corruption. the remainder of the cluster was

> > updated to

> > > > 0.94.6, including the monitors, and the same thing happened.

> > > >

> > > > things were cleaned up and reset and then a test was run

> > > > where min_read_recency_for_promote for the ssd cache pool was set to 1.

> > > > we previously had it set to 6. there was never an error with the

> > recency

> > > > setting set to 1. i then tested with it set to 2 and it immediately

> > caused

> > > > failures. we are currently thinking that it is related to the backport

> > of the fix

> > > > for the recency promotion and are in progress of making a .6 build

> > without

> > > > that backport to see if we can cause corruption. is anyone using a

> > version

> > > > from after the original recency fix (PR 6702) with a cache tier in

> > writeback

> > > > mode? anyone have a similar problem?

> > > >

> > > > mike

> > > >

> > > > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell

> > > > <mike.lovell@xxxxxxxxxxxxx> wrote:

> > > > something weird happened on one of the ceph clusters that i administer

> > > > tonight which resulted in virtual machines using rbd volumes seeing

> > > > corruption in multiple forms.

> > > >

> > > > when everything was fine earlier in the day, the cluster was a number

> > of

> > > > storage nodes spread across 3 different roots in the crush map. the

> > first

> > > > bunch of storage nodes have both hard drives and ssds in them with the

> > hard

> > > > drives in one root and the ssds in another. there is a pool for each

> > and the

> > > > pool for the ssds is a cache tier for the hard drives. the last set of

> > storage

> > > > nodes were in a separate root with their own pool that is being used

> > for burn

> > > > in testing.

> > > >

> > > > these nodes had run for a while with test traffic and we decided to

> > move

> > > > them to the main root and pools. the main cluster is running 0.94.5

> > and the

> > > > new nodes got 0.94.6 due to them getting configured after that was

> > > > released. i removed the test pool and did a ceph osd crush move to move

> > > > the first node into the main cluster, the hard drives into the root

> > for that tier

> > > > of storage and the ssds into the root and pool for the cache tier.

> > each set was

> > > > done about 45 minutes apart and they ran for a couple hours while

> > > > performing backfill without any issue other than high load on the

> > cluster.

> > > >

> > > > we normally run the ssd tier in the forward cache-mode due to the ssds

> > we

> > > > have not being able to keep up with the io of writeback. this results

> > in io on

> > > > the hard drives slowing going up and performance of the cluster

> > starting to

> > > > suffer. about once a week, i change the cache-mode between writeback

> > and

> > > > forward for short periods of time to promote actively used data to the

> > cache

> > > > tier. this moves io load from the hard drive tier to the ssd tier and

> > has been

> > > > done multiple times without issue. i normally don't do this while

> > there are

> > > > backfills or recoveries happening on the cluster but decided to go

> > ahead

> > > > while backfill was happening due to the high load.

> > > >

> > > > i tried this procedure to change the ssd cache-tier between writeback

> > and

> > > > forward cache-mode and things seemed okay from the ceph cluster. about

> > > > 10 minutes after the first attempt a changing the mode, vms using the

> > ceph

> > > > cluster for their storage started seeing corruption in multiple forms.

> > the

> > > > mode was flipped back and forth multiple times in that time frame and

> > its

> > > > unknown if the corruption was noticed with the first change or

> > subsequent

> > > > changes. the vms were having issues of filesystems having errors and

> > getting

> > > > remounted RO and mysql databases seeing corruption (both myisam and

> > > > innodb). some of this was recoverable but on some filesystems there was

> > > > corruption that lead to things like lots of data ending up in the

> > lost+found and

> > > > some of the databases were un-recoverable (backups are helping there).

> > > >

> > > > i'm not sure what would have happened to cause this corruption. the

> > libvirt

> > > > logs for the qemu processes for the vms did not provide any output of

> > > > problems from the ceph client code. it doesn't look like any of the

> > qemu

> > > > processes had crashed. also, it has now been several hours since this

> > > > happened with no additional corruption noticed by the vms. it doesn't

> > > > appear that we had any corruption happen before i attempted the

> > flipping of

> > > > the ssd tier cache-mode.

> > > >

> > > > the only think i can think of that is different between this time

> > doing this

> > > > procedure vs previous attempts was that there was the one storage node

> > > > running 0.94.6 where the remainder were running 0.94.5. is is possible

> > that

> > > > something changed between these two releases that would have caused

> > > > problems with data consistency related to the cache tier? or

> > otherwise? any

> > > > other thoughts or suggestions?

> > > >

> > > > thanks in advance for any help you can provide.

> > > >

> > > > mike

> > > >

> > >

> > >

> > >

> > >

> > _______________________________________________

> > ceph-users mailing list

> > ceph-users@xxxxxxxxxxxxxx

> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >

> >

> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com