Re: data corruption with hammer

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 16 Mar 2016 12:40:35 -0700



This tracker ticket happened to go by my eyes today:
http://tracker.ceph.com/issues/12814 . There isn't a lot of detail
there but the headline matches.
-Greg

On Wed, Mar 16, 2016 at 2:02 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Christian Balzer
>> Sent: 16 March 2016 07:08
>> To: Robert LeBlanc <robert@xxxxxxxxxxxxx>
>> Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; ceph-users <ceph-
>> users@xxxxxxxxxxxxxx>; William Perkins <william.perkins@xxxxxxxxxxxxx>
>> Subject: Re:  data corruption with hammer
>>
>>
>> Hello Robert,
>>
>> On Tue, 15 Mar 2016 10:54:20 -0600 Robert LeBlanc wrote:
>>
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA256
>> >
>> > There are no monitors on the new node.
>> >
>> So one less possible source of confusion.
>>
>> > It doesn't look like there has been any new corruption since we
>> > stopped changing the cache modes. Upon closer inspection, some files
>> > have been changed such that binary files are now ASCII files and visa
>> > versa. These are readable ASCII files and are things like PHP or
>> > script files. Or C files where ASCII files should be.
>> >
>> What would be most interesting is if the objects containing those
> corrupted
>> files did reside on the new OSDs (primary PG) or the old ones, or both.
>>
>> Also, what cache mode was the cluster in before the first switch
> (writeback I
>> presume from the timeline) and which one is it in now?
>>
>> > I've seen this type of corruption before when a SAN node misbehaved
>> > and both controllers were writing concurrently to the backend disks.
>> > The volume was only mounted by one host, but the writes were split
>> > between the controllers when it should have been active/passive.
>> >
>> > We have killed off the OSDs on the new node as a precaution and will
>> > try to replicate this in our lab.
>> >
>> > I suspicion is that is has to do with the cache promotion code update,
>> > but I'm not sure how it would have caused this.
>> >
>> While blissfully unaware of the code, I have a hard time imagining how it
>> would cause that as well.
>> Potentially a regression in the code that only triggers in one cache mode
> and
>> when wanting to promote something?
>>
>> Or if it is actually the switching action, not correctly promoting things
> as it
>> happens?
>> And thus referencing a stale object?
>
> I can't think of any other reason why the recency would break things in any
> other way. Can the OP confirm what recency setting is being used?
>
> When you switch to writeback, if you haven't reached the required recency
> yet, all reads will be proxied, previous behaviour would have pretty much
> promoted all the time regardless. So unless something is happening where
> writes are getting sent to one tier in forward mode and then read from a
> different tier in WB mode, I'm out of ideas.  I'm pretty sure the code says
> Proxy Read then check for promotion, so I'm not even convinced that there
> should be any difference anyway.
>
> I note the documentation states that in forward mode, modified objects get
> written to the backing tier, I'm not if that sounds correct to me. But if
> that is what is happening, that could also be related to the problem???
>
> I think this might be easyish to reproduce using the get/put commands with a
> couple of objects on a test pool if anybody out there is running 94.6 on the
> whole cluster.
>
>>
>> Christian
>>
>> > -----BEGIN PGP SIGNATURE-----
>> > Version: Mailvelope v1.3.6
>> > Comment: https://www.mailvelope.com
>> >
>> >
>> wsFcBAEBCAAQBQJW6D4zCRDmVDuy+mK58QAAoW0QAKmaNnN78m/3/YLI
>> IlAB
>> > U+q9PKXgB4ptds1prEJrB/HJqtxIi021M2urk6iO2XRUgR4qSWZyVJWMmeE9
>> > 6EhM6IvLbweOePr2LJ5nAVEkL5Fns+ya/aOAvilqo2WJGr8jt9J1ABjQgodp
>> >
>> SAGwDywo3GbGUmdxWWy5CrhLsdc9WNhiXdBxREh/uqWFvw2D8/1Uq4/u8
>> tEv
>> > fohrGD+SZfYLQwP9O/v8Rc1C3A0h7N4ytSMiN7Xg2CC9bJDynn0FTrP2LAr/
>> >
>> edEYx+SWF2VtKuG7wVHrQqImTfDUoTLJXP5Q6B+Oxy852qvWzglfoRhaKwGf
>> >
>> fodaxFlTDQaeMnyhMlODRMMXadmiTmyM/WK44YBuMjM8tnlaxf7yKgh09A
>> Dz
>> > ay5oviRWnn7peXmq65TvaZzUfz6Mx5ZWYtqIevaXb0ieFgrxCTdVbdpnMNRt
>> >
>> bMwQ+yVQ8WB5AQmEqN6p6enBCxpvr42p8Eu484dO0xqjIiEOfsMANT/8V63
>> y
>> > RzjPMOaFKFnl3JoYNm61RGAUYszNBeX/Plm/3mP0qiiGBAeHYoxh7DNYlrs/
>> >
>> gUb/O9V0yNuHQIRTs8ZRyrzZKpmh9YMYo8hCsfIqWZjMwEyQaRFuysQB3NaR
>> > lQCO/o12Khv2cygmTCQxS2L7vp2zrkPaS/KietqQ0gwkV1XbynK0XyLkAVDw
>> > zTLa
>> > =Wk/a
>> > -----END PGP SIGNATURE-----
>> > ----------------
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> >
>> > On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>> > >
>> > > Hello,
>> > >
>> > > On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote:
>> > >
>> > >> something weird happened on one of the ceph clusters that i
>> > >> administer tonight which resulted in virtual machines using rbd
>> > >> volumes seeing corruption in multiple forms.
>> > >>
>> > >> when everything was fine earlier in the day, the cluster was a
>> > >> number of storage nodes spread across 3 different roots in the crush
>> map.
>> > >> the first bunch of storage nodes have both hard drives and ssds in
>> > >> them with the hard drives in one root and the ssds in another.
>> > >> there is a pool for each and the pool for the ssds is a cache tier
>> > >> for the hard drives. the last set of storage nodes were in a
>> > >> separate root with their own pool that is being used for burn in
> testing.
>> > >>
>> > >> these nodes had run for a while with test traffic and we decided to
>> > >> move them to the main root and pools. the main cluster is running
>> > >> 0.94.5 and the new nodes got 0.94.6 due to them getting configured
>> > >> after that was released. i removed the test pool and did a ceph osd
>> > >> crush move to move the first node into the main cluster, the hard
>> > >> drives into the root for that tier of storage and the ssds into the
>> > >> root and pool for the cache tier. each set was done about 45
>> > >> minutes apart and they ran for a couple hours while performing
>> > >> backfill without any issue other than high load on the cluster.
>> > >>
>> > > Since I glanced what your setup looks like from Robert's posts and
>> > > yours I won't say the obvious thing, as you aren't using EC pools.
>> > >
>> > >> we normally run the ssd tier in the forward cache-mode due to the
>> > >> ssds we have not being able to keep up with the io of writeback.
>> > >> this results in io on the hard drives slowing going up and
>> > >> performance of the cluster starting to suffer. about once a week, i
>> > >> change the cache-mode between writeback and forward for short
>> > >> periods of time to promote actively used data to the cache tier.
>> > >> this moves io load from the hard drive tier to the ssd tier and has
>> > >> been done multiple times without issue. i normally don't do this
>> > >> while there are backfills or recoveries happening on the cluster
>> > >> but decided to go ahead while backfill was happening due to the high
>> load.
>> > >>
>> > > As you might recall, I managed to have "rados bench" break (I/O
>> > > error) when doing these switches with Firefly on my crappy test
>> > > cluster, but not with Hammer.
>> > > However I haven't done any such switches on my production cluster
>> > > with a cache tier, both because the cache pool hasn't even reached
>> > > 50% capacity after 2 weeks of pounding and because I'm sure that
>> > > everything will hold up when it comes to the first flushing.
>> > >
>> > > Maybe the extreme load (as opposed to normal VM ops) of your cluster
>> > > during the backfilling triggered the same or a similar bug.
>> > >
>> > >> i tried this procedure to change the ssd cache-tier between
>> > >> writeback and forward cache-mode and things seemed okay from the
>> ceph cluster.
>> > >> about 10 minutes after the first attempt a changing the mode, vms
>> > >> using the ceph cluster for their storage started seeing corruption
>> > >> in multiple forms. the mode was flipped back and forth multiple
>> > >> times in that time frame and its unknown if the corruption was
>> > >> noticed with the first change or subsequent changes. the vms were
>> > >> having issues of filesystems having errors and getting remounted RO
>> > >> and mysql databases seeing corruption (both myisam and innodb).
>> > >> some of this was recoverable but on some filesystems there was
>> > >> corruption that lead to things like lots of data ending up in the
>> > >> lost+found and some of the databases were un-recoverable (backups
>> are helping there).
>> > >>
>> > >> i'm not sure what would have happened to cause this corruption. the
>> > >> libvirt logs for the qemu processes for the vms did not provide any
>> > >> output of problems from the ceph client code. it doesn't look like
>> > >> any of the qemu processes had crashed. also, it has now been
>> > >> several hours since this happened with no additional corruption
> noticed
>> by the vms.
>> > >> it doesn't appear that we had any corruption happen before i
>> > >> attempted the flipping of the ssd tier cache-mode.
>> > >>
>> > >> the only think i can think of that is different between this time
>> > >> doing this procedure vs previous attempts was that there was the
>> > >> one storage node running 0.94.6 where the remainder were running
>> 0.94.5.
>> > >> is is possible that something changed between these two releases
>> > >> that would have caused problems with data consistency related to
>> > >> the cache tier? or otherwise? any other thoughts or suggestions?
>> > >>
>> > > What comes to mind in terms of these 2 versions is that .6 has
>> > > working read recency, supposedly.
>> > > Which (as well as Infernalis) exposed the bug(s) when running with
>> > > EC backing pools.
>> > >
>> > > Some cache pool members acting upon the recency and others not might
>> > > confuse things, but you'd think that this is a per OSD (PG) thing
>> > > and objects not promoted being acted upon accordingly.
>> > >
>> > > Those new nodes had no monitors on them, rite?
>> > >
>> > > Christian
>> > >> thanks in advance for any help you can provide.
>> > >>
>> > >> mike
>> > >
>> > >
>> > > --
>> > > Christian Balzer        Network/Systems Engineer
>> > > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> > > http://www.gol.com/
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi@xxxxxxx         Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com