Re: data corruption with hammer

Nick Fisk <nick@xxxxxxxxxx> · Wed, 16 Mar 2016 09:02:27 -0000

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Christian Balzer
> Sent: 16 March 2016 07:08
> To: Robert LeBlanc <robert@xxxxxxxxxxxxx>
> Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; ceph-users <ceph-
> users@xxxxxxxxxxxxxx>; William Perkins <william.perkins@xxxxxxxxxxxxx>
> Subject: Re:  data corruption with hammer
> 
> 
> Hello Robert,
> 
> On Tue, 15 Mar 2016 10:54:20 -0600 Robert LeBlanc wrote:
> 
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > There are no monitors on the new node.
> >
> So one less possible source of confusion.
> 
> > It doesn't look like there has been any new corruption since we
> > stopped changing the cache modes. Upon closer inspection, some files
> > have been changed such that binary files are now ASCII files and visa
> > versa. These are readable ASCII files and are things like PHP or
> > script files. Or C files where ASCII files should be.
> >
> What would be most interesting is if the objects containing those
corrupted
> files did reside on the new OSDs (primary PG) or the old ones, or both.
> 
> Also, what cache mode was the cluster in before the first switch
(writeback I
> presume from the timeline) and which one is it in now?
> 
> > I've seen this type of corruption before when a SAN node misbehaved
> > and both controllers were writing concurrently to the backend disks.
> > The volume was only mounted by one host, but the writes were split
> > between the controllers when it should have been active/passive.
> >
> > We have killed off the OSDs on the new node as a precaution and will
> > try to replicate this in our lab.
> >
> > I suspicion is that is has to do with the cache promotion code update,
> > but I'm not sure how it would have caused this.
> >
> While blissfully unaware of the code, I have a hard time imagining how it
> would cause that as well.
> Potentially a regression in the code that only triggers in one cache mode
and
> when wanting to promote something?
> 
> Or if it is actually the switching action, not correctly promoting things
as it
> happens?
> And thus referencing a stale object?

I can't think of any other reason why the recency would break things in any
other way. Can the OP confirm what recency setting is being used? 

When you switch to writeback, if you haven't reached the required recency
yet, all reads will be proxied, previous behaviour would have pretty much
promoted all the time regardless. So unless something is happening where
writes are getting sent to one tier in forward mode and then read from a
different tier in WB mode, I'm out of ideas.  I'm pretty sure the code says
Proxy Read then check for promotion, so I'm not even convinced that there
should be any difference anyway.

I note the documentation states that in forward mode, modified objects get
written to the backing tier, I'm not if that sounds correct to me. But if
that is what is happening, that could also be related to the problem???

I think this might be easyish to reproduce using the get/put commands with a
couple of objects on a test pool if anybody out there is running 94.6 on the
whole cluster. 

> 
> Christian
> 
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.3.6
> > Comment: https://www.mailvelope.com
> >
> >
> wsFcBAEBCAAQBQJW6D4zCRDmVDuy+mK58QAAoW0QAKmaNnN78m/3/YLI
> IlAB
> > U+q9PKXgB4ptds1prEJrB/HJqtxIi021M2urk6iO2XRUgR4qSWZyVJWMmeE9
> > 6EhM6IvLbweOePr2LJ5nAVEkL5Fns+ya/aOAvilqo2WJGr8jt9J1ABjQgodp
> >
> SAGwDywo3GbGUmdxWWy5CrhLsdc9WNhiXdBxREh/uqWFvw2D8/1Uq4/u8
> tEv
> > fohrGD+SZfYLQwP9O/v8Rc1C3A0h7N4ytSMiN7Xg2CC9bJDynn0FTrP2LAr/
> >
> edEYx+SWF2VtKuG7wVHrQqImTfDUoTLJXP5Q6B+Oxy852qvWzglfoRhaKwGf
> >
> fodaxFlTDQaeMnyhMlODRMMXadmiTmyM/WK44YBuMjM8tnlaxf7yKgh09A
> Dz
> > ay5oviRWnn7peXmq65TvaZzUfz6Mx5ZWYtqIevaXb0ieFgrxCTdVbdpnMNRt
> >
> bMwQ+yVQ8WB5AQmEqN6p6enBCxpvr42p8Eu484dO0xqjIiEOfsMANT/8V63
> y
> > RzjPMOaFKFnl3JoYNm61RGAUYszNBeX/Plm/3mP0qiiGBAeHYoxh7DNYlrs/
> >
> gUb/O9V0yNuHQIRTs8ZRyrzZKpmh9YMYo8hCsfIqWZjMwEyQaRFuysQB3NaR
> > lQCO/o12Khv2cygmTCQxS2L7vp2zrkPaS/KietqQ0gwkV1XbynK0XyLkAVDw
> > zTLa
> > =Wk/a
> > -----END PGP SIGNATURE-----
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote:
> > >
> > >> something weird happened on one of the ceph clusters that i
> > >> administer tonight which resulted in virtual machines using rbd
> > >> volumes seeing corruption in multiple forms.
> > >>
> > >> when everything was fine earlier in the day, the cluster was a
> > >> number of storage nodes spread across 3 different roots in the crush
> map.
> > >> the first bunch of storage nodes have both hard drives and ssds in
> > >> them with the hard drives in one root and the ssds in another.
> > >> there is a pool for each and the pool for the ssds is a cache tier
> > >> for the hard drives. the last set of storage nodes were in a
> > >> separate root with their own pool that is being used for burn in
testing.
> > >>
> > >> these nodes had run for a while with test traffic and we decided to
> > >> move them to the main root and pools. the main cluster is running
> > >> 0.94.5 and the new nodes got 0.94.6 due to them getting configured
> > >> after that was released. i removed the test pool and did a ceph osd
> > >> crush move to move the first node into the main cluster, the hard
> > >> drives into the root for that tier of storage and the ssds into the
> > >> root and pool for the cache tier. each set was done about 45
> > >> minutes apart and they ran for a couple hours while performing
> > >> backfill without any issue other than high load on the cluster.
> > >>
> > > Since I glanced what your setup looks like from Robert's posts and
> > > yours I won't say the obvious thing, as you aren't using EC pools.
> > >
> > >> we normally run the ssd tier in the forward cache-mode due to the
> > >> ssds we have not being able to keep up with the io of writeback.
> > >> this results in io on the hard drives slowing going up and
> > >> performance of the cluster starting to suffer. about once a week, i
> > >> change the cache-mode between writeback and forward for short
> > >> periods of time to promote actively used data to the cache tier.
> > >> this moves io load from the hard drive tier to the ssd tier and has
> > >> been done multiple times without issue. i normally don't do this
> > >> while there are backfills or recoveries happening on the cluster
> > >> but decided to go ahead while backfill was happening due to the high
> load.
> > >>
> > > As you might recall, I managed to have "rados bench" break (I/O
> > > error) when doing these switches with Firefly on my crappy test
> > > cluster, but not with Hammer.
> > > However I haven't done any such switches on my production cluster
> > > with a cache tier, both because the cache pool hasn't even reached
> > > 50% capacity after 2 weeks of pounding and because I'm sure that
> > > everything will hold up when it comes to the first flushing.
> > >
> > > Maybe the extreme load (as opposed to normal VM ops) of your cluster
> > > during the backfilling triggered the same or a similar bug.
> > >
> > >> i tried this procedure to change the ssd cache-tier between
> > >> writeback and forward cache-mode and things seemed okay from the
> ceph cluster.
> > >> about 10 minutes after the first attempt a changing the mode, vms
> > >> using the ceph cluster for their storage started seeing corruption
> > >> in multiple forms. the mode was flipped back and forth multiple
> > >> times in that time frame and its unknown if the corruption was
> > >> noticed with the first change or subsequent changes. the vms were
> > >> having issues of filesystems having errors and getting remounted RO
> > >> and mysql databases seeing corruption (both myisam and innodb).
> > >> some of this was recoverable but on some filesystems there was
> > >> corruption that lead to things like lots of data ending up in the
> > >> lost+found and some of the databases were un-recoverable (backups
> are helping there).
> > >>
> > >> i'm not sure what would have happened to cause this corruption. the
> > >> libvirt logs for the qemu processes for the vms did not provide any
> > >> output of problems from the ceph client code. it doesn't look like
> > >> any of the qemu processes had crashed. also, it has now been
> > >> several hours since this happened with no additional corruption
noticed
> by the vms.
> > >> it doesn't appear that we had any corruption happen before i
> > >> attempted the flipping of the ssd tier cache-mode.
> > >>
> > >> the only think i can think of that is different between this time
> > >> doing this procedure vs previous attempts was that there was the
> > >> one storage node running 0.94.6 where the remainder were running
> 0.94.5.
> > >> is is possible that something changed between these two releases
> > >> that would have caused problems with data consistency related to
> > >> the cache tier? or otherwise? any other thoughts or suggestions?
> > >>
> > > What comes to mind in terms of these 2 versions is that .6 has
> > > working read recency, supposedly.
> > > Which (as well as Infernalis) exposed the bug(s) when running with
> > > EC backing pools.
> > >
> > > Some cache pool members acting upon the recency and others not might
> > > confuse things, but you'd think that this is a per OSD (PG) thing
> > > and objects not promoted being acted upon accordingly.
> > >
> > > Those new nodes had no monitors on them, rite?
> > >
> > > Christian
> > >> thanks in advance for any help you can provide.
> > >>
> > >> mike
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com