Re: data corruption with hammer

Mike Lovell <mike.lovell@xxxxxxxxxxxxx> · Wed, 16 Mar 2016 16:41:36 -0600

robert and i have done some further investigation the past couple days on this. we have a test environment with a hard drive tier and an ssd tier as a cache. several vms were created with volumes from the ceph cluster. i did a test in each guest where i un-tarred the linux kernel source multiple times and then did a md5sum check against all of the files in the resulting source tree. i started off with the monitors and osds running 0.94.5 and never saw any problems.
a single node was then upgraded to 0.94.6 which has osds in both the ssd and hard drive tier. i then proceeded to run the same test and, while the untar and md5sum operations were running, i changed the ssd tier cache-mode from forward to writeback. almost immediately the vms started reporting io errors and odd data corruption. the remainder of the cluster was updated to 0.94.6, including the monitors, and the same thing happened.

things were cleaned up and reset and then a test was run where min_read_recency_for_promote for the ssd cache pool was set to 1. we previously had it set to 6. there was never an error with the recency setting set to 1. i then tested with it set to 2 and it immediately caused failures. we are currently thinking that it is related to the backport of the fix for the recency promotion and are in progress of making a .6 build without that backport to see if we can cause corruption. is anyone using a version from after the original recency fix (PR 6702) with a cache tier in writeback mode? anyone have a similar problem?

mike

On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell <mike.lovell@xxxxxxxxxxxxx> wrote:
something weird happened on one of the ceph clusters that i administer tonight which resulted in virtual machines using rbd volumes seeing corruption in multiple forms.
when everything was fine earlier in the day, the cluster was a number of storage nodes spread across 3 different roots in the crush map. the first bunch of storage nodes have both hard drives and ssds in them with the hard drives in one root and the ssds in another. there is a pool for each and the pool for the ssds is a cache tier for the hard drives. the last set of storage nodes were in a separate root with their own pool that is being used for burn in testing.

these nodes had run for a while with test traffic and we decided to move them to the main root and pools. the main cluster is running 0.94.5 and the new nodes got 0.94.6 due to them getting configured after that was released. i removed the test pool and did a ceph osd crush move to move the first node into the main cluster, the hard drives into the root for that tier of storage and the ssds into the root and pool for the cache tier. each set was done about 45 minutes apart and they ran for a couple hours while performing backfill without any issue other than high load on the cluster.

we normally run the ssd tier in the forward cache-mode due to the ssds we have not being able to keep up with the io of writeback. this results in io on the hard drives slowing going up and performance of the cluster starting to suffer. about once a week, i change the cache-mode between writeback and forward for short periods of time to promote actively used data to the cache tier. this moves io load from the hard drive tier to the ssd tier and has been done multiple times without issue. i normally don't do this while there are backfills or recoveries happening on the cluster but decided to go ahead while backfill was happening due to the high load.

i tried this procedure to change the ssd cache-tier between writeback and forward cache-mode and things seemed okay from the ceph cluster. about 10 minutes after the first attempt a changing the mode, vms using the ceph cluster for their storage started seeing corruption in multiple forms. the mode was flipped back and forth multiple times in that time frame and its unknown if the corruption was noticed with the first change or subsequent changes. the vms were having issues of filesystems having errors and getting remounted RO and mysql databases seeing corruption (both myisam and innodb). some of this was recoverable but on some filesystems there was corruption that lead to things like lots of data ending up in the lost+found and some of the databases were un-recoverable (backups are helping there).

i'm not sure what would have happened to cause this corruption. the libvirt logs for the qemu processes for the vms did not provide any output of problems from the ceph client code. it doesn't look like any of the qemu processes had crashed. also, it has now been several hours since this happened with no additional corruption noticed by the vms. it doesn't appear that we had any corruption happen before i attempted the flipping of the ssd tier cache-mode.

the only think i can think of that is different between this time doing this procedure vs previous attempts was that there was the one storage node running 0.94.6 where the remainder were running 0.94.5. is is possible that something changed between these two releases that would have caused problems with data consistency related to the cache tier? or otherwise? any other thoughts or suggestions?

thanks in advance for any help you can provide.

mike

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com