-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I'm having trouble finding documentation about using ceph_test_rados. Can I run this on the existing cluster and will that provide useful info? It seems running it in the build will not have the caching set up (vstart.sh). I have accepted a job with another company and only have until Wednesday to help with getting information about this bug. My new job will not be using Ceph, so I won't be able to provide any additional info after Tuesday. I want to leave the company on a good trajectory for upgrading, so any input you can provide will be helpful. I've found: ./ceph_test_rados --op read 100 --op write 100 --op delete 50 - --max-ops 400000 --objects 1024 --max-in-flight 64 --size 4000000 - --min-stride-size 400000 --max-stride-size 800000 --max-seconds 600 - --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0 Is that enough if I change --pool to the cached pool and do the toggling while ceph_test_rados is running? I think this will run for 10 minutes. Thanks, -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW6tjwCRDmVDuy+mK58QAANKgP/ia5TA/7kTUpmciVR2BW t0MrilXAIvdikHlaWTVIxEmb4S8X+57hziEZUd6hLBMnKnuUQxsDb3yyuZX4 iqaE8KBXDjMFjHnhTOFf7eB2JIjM1WkZxmlA23yBRMNtvlBArbwxYYnAyTXt /fW1QmgLZIvuql1y01TdRot/owqJ3B2Ah896lySrltWj626R+1rhTLVDWYr6 EKa1mf8BiRBeGpjEVhN6Vihb7T1IzHtCi1E6+mlSqhWGNf8AeZh8IKUT0tbm C/JiUVGmG8/t7WFzCiQWd1w8UdkdCzms7k662CsSLIpbjNo4ouwEkpb5sZLP ELgWxo8hvad7USqSXvXqJNzmoenUwQwdUvSjYbNk+4D+8eHqptlNXDmDfpiE pN7dp8wbJ+yICxMPLuUe/Iqzp6rRnjPwam/CiDZu52N1ncH3X1X4u0cuAD0Z dFjEfdAZJAJ+fqvts2zVvtOwq/q41eTuV3ZRSn5ubA6iAeKnxMtPoEcuozEp Su1Iud2fYdma5w8MFStjp1BAV3osg1WgIM6KYzsSZI1BkCQAqU58ROZ0ZsMb D05/AEK/A6fp0ROXUczhXDcXlXcGEWyJm1QEtg7cSu3C+9qu5qvQQxyrrwbZ MK8C5lhVb44sqSVcSIZ+KCrPC+x8UKodDQZCz6O6NrJjZLn2g06583cMFWK8 qLo+ =qgB7 -----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Thu, Mar 17, 2016 at 8:19 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
On Thu, 17 Mar 2016, Robert LeBlanc wrote:
> We are trying to figure out how to use rados bench to reproduce. Ceph
> itself doesn't seem to think there is any corruption, but when you do a
> verify inside the RBD, there is. Can rados bench verify the objects after
> they are written? It also seems to be primarily the filesystem metadata
> that is corrupted. If we fsck the volume, there is missing data (put into
> lost+found), but if it is there it is primarily OK. There only seems to be
> a few cases where a file's contents are corrupted. I would suspect on an
> object boundary. We would have to look at blockinfo to map that out and see
> if that is what is happening.
'rados bench' doesn't do validation. ceph_test_rados does, though--if you
can reproduce with that workload then it should be pretty easy to track
down.
Thanks!
sage
> We stopped all the IO and did put the tier in writeback mode with recency
> 1, set the recency to 2 and started the test and there was corruption, so
> it doesn't seem to be limited to changing the mode. I don't know how that
> patch could cause the issue either. Unless there is a bug that reads from
> the back tier, but writes to cache tier, then the object gets promoted
> wiping that last write, but then it seems like it should not be as much
> corruption since the metadata should be in the cache pretty quick. We
> usually evited the cache before each try so we should not be evicting on
> writeback.
>
> Sent from a mobile device, please excuse any typos.
> On Mar 17, 2016 6:26 AM, "Sage Weil" <sweil@xxxxxxxxxx> wrote:
>
> > On Thu, 17 Mar 2016, Nick Fisk wrote:
> > > There is got to be something else going on here. All that PR does is to
> > > potentially delay the promotion to hit_set_period*recency instead of
> > > just doing it on the 2nd read regardless, it's got to be uncovering
> > > another bug.
> > >
> > > Do you see the same problem if the cache is in writeback mode before you
> > > start the unpacking. Ie is it the switching mid operation which causes
> > > the problem? If it only happens mid operation, does it still occur if
> > > you pause IO when you make the switch?
> > >
> > > Do you also see this if you perform on a RBD mount, to rule out any
> > > librbd/qemu weirdness?
> > >
> > > Do you know if it’s the actual data that is getting corrupted or if it's
> > > the FS metadata? I'm only wondering as unpacking should really only be
> > > writing to each object a couple of times, whereas FS metadata could
> > > potentially be being updated+read back lots of times for the same group
> > > of objects and ordering is very important.
> > >
> > > Thinking through it logically the only difference is that with recency=1
> > > the object will be copied up to the cache tier, where recency=6 it will
> > > be proxy read for a long time. If I had to guess I would say the issue
> > > would lie somewhere in the proxy read + writeback<->forward logic.
> >
> > That seems reasonable. Was switching from writeback -> forward always
> > part of the sequence that resulted in corruption? Not that there is a
> > known ordering issue when switching to forward mode. I wouldn't really
> > expect it to bite real users but it's possible..
> >
> > http://tracker.ceph.com/issues/12814
> >
> > I've opened a ticket to track this:
> >
> > http://tracker.ceph.com/issues/15171
> >
> > What would be *really* great is if you could reproduce this with a
> > ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados
> > running, and then find the sequence of operations that are sufficient to
> > trigger a failure.
> >
> > sage
> >
> >
> >
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of
> > > > Mike Lovell
> > > > Sent: 16 March 2016 23:23
> > > > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>; sweil@xxxxxxxxxx
> > > > Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; William Perkins
> > > > <william.perkins@xxxxxxxxxxxxx>
> > > > Subject: Re: data corruption with hammer
> > > >
> > > > just got done with a test against a build of 0.94.6 minus the two
> > commits that
> > > > were backported in PR 7207. everything worked as it should with the
> > cache-
> > > > mode set to writeback and the min_read_recency_for_promote set to 2.
> > > > assuming it works properly on master, there must be a commit that we're
> > > > missing on the backport to support this properly.
> > > >
> > > > sage,
> > > > i'm adding you to the recipients on this so hopefully you see it. the
> > tl;dr
> > > > version is that the backport of the cache recency fix to hammer
> > doesn't work
> > > > right and potentially corrupts data when
> > > > the min_read_recency_for_promote is set to greater than 1.
> > > >
> > > > mike
> > > >
> > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > > > <mike.lovell@xxxxxxxxxxxxx> wrote:
> > > > robert and i have done some further investigation the past couple days
> > on
> > > > this. we have a test environment with a hard drive tier and an ssd
> > tier as a
> > > > cache. several vms were created with volumes from the ceph cluster. i
> > did a
> > > > test in each guest where i un-tarred the linux kernel source multiple
> > times
> > > > and then did a md5sum check against all of the files in the resulting
> > source
> > > > tree. i started off with the monitors and osds running 0.94.5 and
> > never saw
> > > > any problems.
> > > >
> > > > a single node was then upgraded to 0.94.6 which has osds in both the
> > ssd and
> > > > hard drive tier. i then proceeded to run the same test and, while the
> > untar
> > > > and md5sum operations were running, i changed the ssd tier cache-mode
> > > > from forward to writeback. almost immediately the vms started
> > reporting io
> > > > errors and odd data corruption. the remainder of the cluster was
> > updated to
> > > > 0.94.6, including the monitors, and the same thing happened.
> > > >
> > > > things were cleaned up and reset and then a test was run
> > > > where min_read_recency_for_promote for the ssd cache pool was set to 1.
> > > > we previously had it set to 6. there was never an error with the
> > recency
> > > > setting set to 1. i then tested with it set to 2 and it immediately
> > caused
> > > > failures. we are currently thinking that it is related to the backport
> > of the fix
> > > > for the recency promotion and are in progress of making a .6 build
> > without
> > > > that backport to see if we can cause corruption. is anyone using a
> > version
> > > > from after the original recency fix (PR 6702) with a cache tier in
> > writeback
> > > > mode? anyone have a similar problem?
> > > >
> > > > mike
> > > >
> > > > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell
> > > > <mike.lovell@xxxxxxxxxxxxx> wrote:
> > > > something weird happened on one of the ceph clusters that i administer
> > > > tonight which resulted in virtual machines using rbd volumes seeing
> > > > corruption in multiple forms.
> > > >
> > > > when everything was fine earlier in the day, the cluster was a number
> > of
> > > > storage nodes spread across 3 different roots in the crush map. the
> > first
> > > > bunch of storage nodes have both hard drives and ssds in them with the
> > hard
> > > > drives in one root and the ssds in another. there is a pool for each
> > and the
> > > > pool for the ssds is a cache tier for the hard drives. the last set of
> > storage
> > > > nodes were in a separate root with their own pool that is being used
> > for burn
> > > > in testing.
> > > >
> > > > these nodes had run for a while with test traffic and we decided to
> > move
> > > > them to the main root and pools. the main cluster is running 0.94.5
> > and the
> > > > new nodes got 0.94.6 due to them getting configured after that was
> > > > released. i removed the test pool and did a ceph osd crush move to move
> > > > the first node into the main cluster, the hard drives into the root
> > for that tier
> > > > of storage and the ssds into the root and pool for the cache tier.
> > each set was
> > > > done about 45 minutes apart and they ran for a couple hours while
> > > > performing backfill without any issue other than high load on the
> > cluster.
> > > >
> > > > we normally run the ssd tier in the forward cache-mode due to the ssds
> > we
> > > > have not being able to keep up with the io of writeback. this results
> > in io on
> > > > the hard drives slowing going up and performance of the cluster
> > starting to
> > > > suffer. about once a week, i change the cache-mode between writeback
> > and
> > > > forward for short periods of time to promote actively used data to the
> > cache
> > > > tier. this moves io load from the hard drive tier to the ssd tier and
> > has been
> > > > done multiple times without issue. i normally don't do this while
> > there are
> > > > backfills or recoveries happening on the cluster but decided to go
> > ahead
> > > > while backfill was happening due to the high load.
> > > >
> > > > i tried this procedure to change the ssd cache-tier between writeback
> > and
> > > > forward cache-mode and things seemed okay from the ceph cluster. about
> > > > 10 minutes after the first attempt a changing the mode, vms using the
> > ceph
> > > > cluster for their storage started seeing corruption in multiple forms.
> > the
> > > > mode was flipped back and forth multiple times in that time frame and
> > its
> > > > unknown if the corruption was noticed with the first change or
> > subsequent
> > > > changes. the vms were having issues of filesystems having errors and
> > getting
> > > > remounted RO and mysql databases seeing corruption (both myisam and
> > > > innodb). some of this was recoverable but on some filesystems there was
> > > > corruption that lead to things like lots of data ending up in the
> > lost+found and
> > > > some of the databases were un-recoverable (backups are helping there).
> > > >
> > > > i'm not sure what would have happened to cause this corruption. the
> > libvirt
> > > > logs for the qemu processes for the vms did not provide any output of
> > > > problems from the ceph client code. it doesn't look like any of the
> > qemu
> > > > processes had crashed. also, it has now been several hours since this
> > > > happened with no additional corruption noticed by the vms. it doesn't
> > > > appear that we had any corruption happen before i attempted the
> > flipping of
> > > > the ssd tier cache-mode.
> > > >
> > > > the only think i can think of that is different between this time
> > doing this
> > > > procedure vs previous attempts was that there was the one storage node
> > > > running 0.94.6 where the remainder were running 0.94.5. is is possible
> > that
> > > > something changed between these two releases that would have caused
> > > > problems with data consistency related to the cache tier? or
> > otherwise? any
> > > > other thoughts or suggestions?
> > > >
> > > > thanks in advance for any help you can provide.
> > > >
> > > > mike
> > > >
> > >
> > >
> > >
> > >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com