On Thu, 17 Mar 2016, Robert LeBlanc wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > I'm having trouble finding documentation about using ceph_test_rados. Can I > run this on the existing cluster and will that provide useful info? It seems > running it in the build will not have the caching set up (vstart.sh). > > I have accepted a job with another company and only have until Wednesday to > help with getting information about this bug. My new job will not be using C > eph, so I won't be able to provide any additional info after Tuesday. I want > to leave the company on a good trajectory for upgrading, so any input you c > an provide will be helpful. I'm sorry to hear it! You'll be missed. :) > I've found: > > ./ceph_test_rados --op read 100 --op write 100 --op delete 50 > - --max-ops 400000 --objects 1024 --max-in-flight 64 --size 4000000 > - --min-stride-size 400000 --max-stride-size 800000 --max-seconds 600 > - --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op > rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0 > > Is that enough if I change --pool to the cached pool and do the toggling whi > le ceph_test_rados is running? I think this will run for 10 minutes. Precisely. You can probably drop copy_from and snap ops from the list since your workload wasn't exercising those. Thanks! sage > > Thanks, > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.3.6 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJW6tjwCRDmVDuy+mK58QAANKgP/ia5TA/7kTUpmciVR2BW > t0MrilXAIvdikHlaWTVIxEmb4S8X+57hziEZUd6hLBMnKnuUQxsDb3yyuZX4 > iqaE8KBXDjMFjHnhTOFf7eB2JIjM1WkZxmlA23yBRMNtvlBArbwxYYnAyTXt > /fW1QmgLZIvuql1y01TdRot/owqJ3B2Ah896lySrltWj626R+1rhTLVDWYr6 > EKa1mf8BiRBeGpjEVhN6Vihb7T1IzHtCi1E6+mlSqhWGNf8AeZh8IKUT0tbm > C/JiUVGmG8/t7WFzCiQWd1w8UdkdCzms7k662CsSLIpbjNo4ouwEkpb5sZLP > ELgWxo8hvad7USqSXvXqJNzmoenUwQwdUvSjYbNk+4D+8eHqptlNXDmDfpiE > pN7dp8wbJ+yICxMPLuUe/Iqzp6rRnjPwam/CiDZu52N1ncH3X1X4u0cuAD0Z > dFjEfdAZJAJ+fqvts2zVvtOwq/q41eTuV3ZRSn5ubA6iAeKnxMtPoEcuozEp > Su1Iud2fYdma5w8MFStjp1BAV3osg1WgIM6KYzsSZI1BkCQAqU58ROZ0ZsMb > D05/AEK/A6fp0ROXUczhXDcXlXcGEWyJm1QEtg7cSu3C+9qu5qvQQxyrrwbZ > MK8C5lhVb44sqSVcSIZ+KCrPC+x8UKodDQZCz6O6NrJjZLn2g06583cMFWK8 > qLo+ > =qgB7 > -----END PGP SIGNATURE----- > > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > On Thu, Mar 17, 2016 at 8:19 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Thu, 17 Mar 2016, Robert LeBlanc wrote: > > We are trying to figure out how to use rados bench to > reproduce. Ceph > > itself doesn't seem to think there is any corruption, but when > you do a > > verify inside the RBD, there is. Can rados bench verify the > objects after > > they are written? It also seems to be primarily the filesystem > metadata > > that is corrupted. If we fsck the volume, there is missing > data (put into > > lost+found), but if it is there it is primarily OK. There only > seems to be > > a few cases where a file's contents are corrupted. I would > suspect on an > > object boundary. We would have to look at blockinfo to map > that out and see > > if that is what is happening. > > 'rados bench' doesn't do validation. ceph_test_rados does, > though--if you > can reproduce with that workload then it should be pretty easy > to track > down. > > Thanks! > sage > > > > We stopped all the IO and did put the tier in writeback mode > with recency > > 1, set the recency to 2 and started the test and there was > corruption, so > > it doesn't seem to be limited to changing the mode. I don't > know how that > > patch could cause the issue either. Unless there is a bug that > reads from > > the back tier, but writes to cache tier, then the object gets > promoted > > wiping that last write, but then it seems like it should not > be as much > > corruption since the metadata should be in the cache pretty > quick. We > > usually evited the cache before each try so we should not be > evicting on > > writeback. > > > > Sent from a mobile device, please excuse any typos. > > On Mar 17, 2016 6:26 AM, "Sage Weil" <sweil@xxxxxxxxxx> wrote: > > > > > On Thu, 17 Mar 2016, Nick Fisk wrote: > > > > There is got to be something else going on here. All that > PR does is to > > > > potentially delay the promotion to hit_set_period*recency > instead of > > > > just doing it on the 2nd read regardless, it's got to be > uncovering > > > > another bug. > > > > > > > > Do you see the same problem if the cache is in writeback > mode before you > > > > start the unpacking. Ie is it the switching mid operation > which causes > > > > the problem? If it only happens mid operation, does it > still occur if > > > > you pause IO when you make the switch? > > > > > > > > Do you also see this if you perform on a RBD mount, to > rule out any > > > > librbd/qemu weirdness? > > > > > > > > Do you know if it’s the actual data that is getting > corrupted or if it's > > > > the FS metadata? I'm only wondering as unpacking should > really only be > > > > writing to each object a couple of times, whereas FS > metadata could > > > > potentially be being updated+read back lots of times for > the same group > > > > of objects and ordering is very important. > > > > > > > > Thinking through it logically the only difference is that > with recency=1 > > > > the object will be copied up to the cache tier, where > recency=6 it will > > > > be proxy read for a long time. If I had to guess I would > say the issue > > > > would lie somewhere in the proxy read + > writeback<->forward logic. > > > > > > That seems reasonable. Was switching from writeback -> > forward always > > > part of the sequence that resulted in corruption? Not that > there is a > > > known ordering issue when switching to forward mode. I > wouldn't really > > > expect it to bite real users but it's possible.. > > > > > > http://tracker.ceph.com/issues/12814 > > > > > > I've opened a ticket to track this: > > > > > > http://tracker.ceph.com/issues/15171 > > > > > > What would be *really* great is if you could reproduce this > with a > > > ceph_test_rados workload (from ceph-tests). I.e., get > ceph_test_rados > > > running, and then find the sequence of operations that are > sufficient to > > > trigger a failure. > > > > > > sage > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: ceph-users > [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > > > Of > > > > > Mike Lovell > > > > > Sent: 16 March 2016 23:23 > > > > > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>; > sweil@xxxxxxxxxx > > > > > Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; > William Perkins > > > > > <william.perkins@xxxxxxxxxxxxx> > > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > > > > > > > just got done with a test against a build of 0.94.6 > minus the two > > > commits that > > > > > were backported in PR 7207. everything worked as it > should with the > > > cache- > > > > > mode set to writeback and the > min_read_recency_for_promote set to 2. > > > > > assuming it works properly on master, there must be a > commit that we're > > > > > missing on the backport to support this properly. > > > > > > > > > > sage, > > > > > i'm adding you to the recipients on this so hopefully > you see it. the > > > tl;dr > > > > > version is that the backport of the cache recency fix to > hammer > > > doesn't work > > > > > right and potentially corrupts data when > > > > > the min_read_recency_for_promote is set to greater than > 1. > > > > > > > > > > mike > > > > > > > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > > > > <mike.lovell@xxxxxxxxxxxxx> wrote: > > > > > robert and i have done some further investigation the > past couple days > > > on > > > > > this. we have a test environment with a hard drive tier > and an ssd > > > tier as a > > > > > cache. several vms were created with volumes from the > ceph cluster. i > > > did a > > > > > test in each guest where i un-tarred the linux kernel > source multiple > > > times > > > > > and then did a md5sum check against all of the files in > the resulting > > > source > > > > > tree. i started off with the monitors and osds running > 0.94.5 and > > > never saw > > > > > any problems. > > > > > > > > > > a single node was then upgraded to 0.94.6 which has osds > in both the > > > ssd and > > > > > hard drive tier. i then proceeded to run the same test > and, while the > > > untar > > > > > and md5sum operations were running, i changed the ssd > tier cache-mode > > > > > from forward to writeback. almost immediately the vms > started > > > reporting io > > > > > errors and odd data corruption. the remainder of the > cluster was > > > updated to > > > > > 0.94.6, including the monitors, and the same thing > happened. > > > > > > > > > > things were cleaned up and reset and then a test was run > > > > > where min_read_recency_for_promote for the ssd cache > pool was set to 1. > > > > > we previously had it set to 6. there was never an error > with the > > > recency > > > > > setting set to 1. i then tested with it set to 2 and it > immediately > > > caused > > > > > failures. we are currently thinking that it is related > to the backport > > > of the fix > > > > > for the recency promotion and are in progress of making > a .6 build > > > without > > > > > that backport to see if we can cause corruption. is > anyone using a > > > version > > > > > from after the original recency fix (PR 6702) with a > cache tier in > > > writeback > > > > > mode? anyone have a similar problem? > > > > > > > > > > mike > > > > > > > > > > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell > > > > > <mike.lovell@xxxxxxxxxxxxx> wrote: > > > > > something weird happened on one of the ceph clusters > that i administer > > > > > tonight which resulted in virtual machines using rbd > volumes seeing > > > > > corruption in multiple forms. > > > > > > > > > > when everything was fine earlier in the day, the cluster > was a number > > > of > > > > > storage nodes spread across 3 different roots in the > crush map. the > > > first > > > > > bunch of storage nodes have both hard drives and ssds in > them with the > > > hard > > > > > drives in one root and the ssds in another. there is a > pool for each > > > and the > > > > > pool for the ssds is a cache tier for the hard drives. > the last set of > > > storage > > > > > nodes were in a separate root with their own pool that > is being used > > > for burn > > > > > in testing. > > > > > > > > > > these nodes had run for a while with test traffic and we > decided to > > > move > > > > > them to the main root and pools. the main cluster is > running 0.94.5 > > > and the > > > > > new nodes got 0.94.6 due to them getting configured > after that was > > > > > released. i removed the test pool and did a ceph osd > crush move to move > > > > > the first node into the main cluster, the hard drives > into the root > > > for that tier > > > > > of storage and the ssds into the root and pool for the > cache tier. > > > each set was > > > > > done about 45 minutes apart and they ran for a couple > hours while > > > > > performing backfill without any issue other than high > load on the > > > cluster. > > > > > > > > > > we normally run the ssd tier in the forward cache-mode > due to the ssds > > > we > > > > > have not being able to keep up with the io of writeback. > this results > > > in io on > > > > > the hard drives slowing going up and performance of the > cluster > > > starting to > > > > > suffer. about once a week, i change the cache-mode > between writeback > > > and > > > > > forward for short periods of time to promote actively > used data to the > > > cache > > > > > tier. this moves io load from the hard drive tier to the > ssd tier and > > > has been > > > > > done multiple times without issue. i normally don't do > this while > > > there are > > > > > backfills or recoveries happening on the cluster but > decided to go > > > ahead > > > > > while backfill was happening due to the high load. > > > > > > > > > > i tried this procedure to change the ssd cache-tier > between writeback > > > and > > > > > forward cache-mode and things seemed okay from the ceph > cluster. about > > > > > 10 minutes after the first attempt a changing the mode, > vms using the > > > ceph > > > > > cluster for their storage started seeing corruption in > multiple forms. > > > the > > > > > mode was flipped back and forth multiple times in that > time frame and > > > its > > > > > unknown if the corruption was noticed with the first > change or > > > subsequent > > > > > changes. the vms were having issues of filesystems > having errors and > > > getting > > > > > remounted RO and mysql databases seeing corruption (both > myisam and > > > > > innodb). some of this was recoverable but on some > filesystems there was > > > > > corruption that lead to things like lots of data ending > up in the > > > lost+found and > > > > > some of the databases were un-recoverable (backups are > helping there). > > > > > > > > > > i'm not sure what would have happened to cause this > corruption. the > > > libvirt > > > > > logs for the qemu processes for the vms did not provide > any output of > > > > > problems from the ceph client code. it doesn't look like > any of the > > > qemu > > > > > processes had crashed. also, it has now been several > hours since this > > > > > happened with no additional corruption noticed by the > vms. it doesn't > > > > > appear that we had any corruption happen before i > attempted the > > > flipping of > > > > > the ssd tier cache-mode. > > > > > > > > > > the only think i can think of that is different between > this time > > > doing this > > > > > procedure vs previous attempts was that there was the > one storage node > > > > > running 0.94.6 where the remainder were running 0.94.5. > is is possible > > > that > > > > > something changed between these two releases that would > have caused > > > > > problems with data consistency related to the cache > tier? or > > > otherwise? any > > > > > other thoughts or suggestions? > > > > > > > > > > thanks in advance for any help you can provide. > > > > > > > > > > mike > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > >