This tracker ticket happened to go by my eyes today: http://tracker.ceph.com/issues/12814 . There isn't a lot of detail there but the headline matches. -Greg On Wed, Mar 16, 2016 at 2:02 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >> Christian Balzer >> Sent: 16 March 2016 07:08 >> To: Robert LeBlanc <robert@xxxxxxxxxxxxx> >> Cc: Robert LeBlanc <robert.leblanc@xxxxxxxxxxxxx>; ceph-users <ceph- >> users@xxxxxxxxxxxxxx>; William Perkins <william.perkins@xxxxxxxxxxxxx> >> Subject: Re: data corruption with hammer >> >> >> Hello Robert, >> >> On Tue, 15 Mar 2016 10:54:20 -0600 Robert LeBlanc wrote: >> >> > -----BEGIN PGP SIGNED MESSAGE----- >> > Hash: SHA256 >> > >> > There are no monitors on the new node. >> > >> So one less possible source of confusion. >> >> > It doesn't look like there has been any new corruption since we >> > stopped changing the cache modes. Upon closer inspection, some files >> > have been changed such that binary files are now ASCII files and visa >> > versa. These are readable ASCII files and are things like PHP or >> > script files. Or C files where ASCII files should be. >> > >> What would be most interesting is if the objects containing those > corrupted >> files did reside on the new OSDs (primary PG) or the old ones, or both. >> >> Also, what cache mode was the cluster in before the first switch > (writeback I >> presume from the timeline) and which one is it in now? >> >> > I've seen this type of corruption before when a SAN node misbehaved >> > and both controllers were writing concurrently to the backend disks. >> > The volume was only mounted by one host, but the writes were split >> > between the controllers when it should have been active/passive. >> > >> > We have killed off the OSDs on the new node as a precaution and will >> > try to replicate this in our lab. >> > >> > I suspicion is that is has to do with the cache promotion code update, >> > but I'm not sure how it would have caused this. >> > >> While blissfully unaware of the code, I have a hard time imagining how it >> would cause that as well. >> Potentially a regression in the code that only triggers in one cache mode > and >> when wanting to promote something? >> >> Or if it is actually the switching action, not correctly promoting things > as it >> happens? >> And thus referencing a stale object? > > I can't think of any other reason why the recency would break things in any > other way. Can the OP confirm what recency setting is being used? > > When you switch to writeback, if you haven't reached the required recency > yet, all reads will be proxied, previous behaviour would have pretty much > promoted all the time regardless. So unless something is happening where > writes are getting sent to one tier in forward mode and then read from a > different tier in WB mode, I'm out of ideas. I'm pretty sure the code says > Proxy Read then check for promotion, so I'm not even convinced that there > should be any difference anyway. > > I note the documentation states that in forward mode, modified objects get > written to the backing tier, I'm not if that sounds correct to me. But if > that is what is happening, that could also be related to the problem??? > > I think this might be easyish to reproduce using the get/put commands with a > couple of objects on a test pool if anybody out there is running 94.6 on the > whole cluster. > >> >> Christian >> >> > -----BEGIN PGP SIGNATURE----- >> > Version: Mailvelope v1.3.6 >> > Comment: https://www.mailvelope.com >> > >> > >> wsFcBAEBCAAQBQJW6D4zCRDmVDuy+mK58QAAoW0QAKmaNnN78m/3/YLI >> IlAB >> > U+q9PKXgB4ptds1prEJrB/HJqtxIi021M2urk6iO2XRUgR4qSWZyVJWMmeE9 >> > 6EhM6IvLbweOePr2LJ5nAVEkL5Fns+ya/aOAvilqo2WJGr8jt9J1ABjQgodp >> > >> SAGwDywo3GbGUmdxWWy5CrhLsdc9WNhiXdBxREh/uqWFvw2D8/1Uq4/u8 >> tEv >> > fohrGD+SZfYLQwP9O/v8Rc1C3A0h7N4ytSMiN7Xg2CC9bJDynn0FTrP2LAr/ >> > >> edEYx+SWF2VtKuG7wVHrQqImTfDUoTLJXP5Q6B+Oxy852qvWzglfoRhaKwGf >> > >> fodaxFlTDQaeMnyhMlODRMMXadmiTmyM/WK44YBuMjM8tnlaxf7yKgh09A >> Dz >> > ay5oviRWnn7peXmq65TvaZzUfz6Mx5ZWYtqIevaXb0ieFgrxCTdVbdpnMNRt >> > >> bMwQ+yVQ8WB5AQmEqN6p6enBCxpvr42p8Eu484dO0xqjIiEOfsMANT/8V63 >> y >> > RzjPMOaFKFnl3JoYNm61RGAUYszNBeX/Plm/3mP0qiiGBAeHYoxh7DNYlrs/ >> > >> gUb/O9V0yNuHQIRTs8ZRyrzZKpmh9YMYo8hCsfIqWZjMwEyQaRFuysQB3NaR >> > lQCO/o12Khv2cygmTCQxS2L7vp2zrkPaS/KietqQ0gwkV1XbynK0XyLkAVDw >> > zTLa >> > =Wk/a >> > -----END PGP SIGNATURE----- >> > ---------------- >> > Robert LeBlanc >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> > >> > >> > On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer <chibi@xxxxxxx> wrote: >> > > >> > > Hello, >> > > >> > > On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote: >> > > >> > >> something weird happened on one of the ceph clusters that i >> > >> administer tonight which resulted in virtual machines using rbd >> > >> volumes seeing corruption in multiple forms. >> > >> >> > >> when everything was fine earlier in the day, the cluster was a >> > >> number of storage nodes spread across 3 different roots in the crush >> map. >> > >> the first bunch of storage nodes have both hard drives and ssds in >> > >> them with the hard drives in one root and the ssds in another. >> > >> there is a pool for each and the pool for the ssds is a cache tier >> > >> for the hard drives. the last set of storage nodes were in a >> > >> separate root with their own pool that is being used for burn in > testing. >> > >> >> > >> these nodes had run for a while with test traffic and we decided to >> > >> move them to the main root and pools. the main cluster is running >> > >> 0.94.5 and the new nodes got 0.94.6 due to them getting configured >> > >> after that was released. i removed the test pool and did a ceph osd >> > >> crush move to move the first node into the main cluster, the hard >> > >> drives into the root for that tier of storage and the ssds into the >> > >> root and pool for the cache tier. each set was done about 45 >> > >> minutes apart and they ran for a couple hours while performing >> > >> backfill without any issue other than high load on the cluster. >> > >> >> > > Since I glanced what your setup looks like from Robert's posts and >> > > yours I won't say the obvious thing, as you aren't using EC pools. >> > > >> > >> we normally run the ssd tier in the forward cache-mode due to the >> > >> ssds we have not being able to keep up with the io of writeback. >> > >> this results in io on the hard drives slowing going up and >> > >> performance of the cluster starting to suffer. about once a week, i >> > >> change the cache-mode between writeback and forward for short >> > >> periods of time to promote actively used data to the cache tier. >> > >> this moves io load from the hard drive tier to the ssd tier and has >> > >> been done multiple times without issue. i normally don't do this >> > >> while there are backfills or recoveries happening on the cluster >> > >> but decided to go ahead while backfill was happening due to the high >> load. >> > >> >> > > As you might recall, I managed to have "rados bench" break (I/O >> > > error) when doing these switches with Firefly on my crappy test >> > > cluster, but not with Hammer. >> > > However I haven't done any such switches on my production cluster >> > > with a cache tier, both because the cache pool hasn't even reached >> > > 50% capacity after 2 weeks of pounding and because I'm sure that >> > > everything will hold up when it comes to the first flushing. >> > > >> > > Maybe the extreme load (as opposed to normal VM ops) of your cluster >> > > during the backfilling triggered the same or a similar bug. >> > > >> > >> i tried this procedure to change the ssd cache-tier between >> > >> writeback and forward cache-mode and things seemed okay from the >> ceph cluster. >> > >> about 10 minutes after the first attempt a changing the mode, vms >> > >> using the ceph cluster for their storage started seeing corruption >> > >> in multiple forms. the mode was flipped back and forth multiple >> > >> times in that time frame and its unknown if the corruption was >> > >> noticed with the first change or subsequent changes. the vms were >> > >> having issues of filesystems having errors and getting remounted RO >> > >> and mysql databases seeing corruption (both myisam and innodb). >> > >> some of this was recoverable but on some filesystems there was >> > >> corruption that lead to things like lots of data ending up in the >> > >> lost+found and some of the databases were un-recoverable (backups >> are helping there). >> > >> >> > >> i'm not sure what would have happened to cause this corruption. the >> > >> libvirt logs for the qemu processes for the vms did not provide any >> > >> output of problems from the ceph client code. it doesn't look like >> > >> any of the qemu processes had crashed. also, it has now been >> > >> several hours since this happened with no additional corruption > noticed >> by the vms. >> > >> it doesn't appear that we had any corruption happen before i >> > >> attempted the flipping of the ssd tier cache-mode. >> > >> >> > >> the only think i can think of that is different between this time >> > >> doing this procedure vs previous attempts was that there was the >> > >> one storage node running 0.94.6 where the remainder were running >> 0.94.5. >> > >> is is possible that something changed between these two releases >> > >> that would have caused problems with data consistency related to >> > >> the cache tier? or otherwise? any other thoughts or suggestions? >> > >> >> > > What comes to mind in terms of these 2 versions is that .6 has >> > > working read recency, supposedly. >> > > Which (as well as Infernalis) exposed the bug(s) when running with >> > > EC backing pools. >> > > >> > > Some cache pool members acting upon the recency and others not might >> > > confuse things, but you'd think that this is a per OSD (PG) thing >> > > and objects not promoted being acted upon accordingly. >> > > >> > > Those new nodes had no monitors on them, rite? >> > > >> > > Christian >> > >> thanks in advance for any help you can provide. >> > >> >> > >> mike >> > > >> > > >> > > -- >> > > Christian Balzer Network/Systems Engineer >> > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> > > http://www.gol.com/ >> > > _______________________________________________ >> > > ceph-users mailing list >> > > ceph-users@xxxxxxxxxxxxxx >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com