Re: Sudden omap growth on some OSDs

<george.vasilakakos@xxxxxxxxxx> · Wed, 13 Dec 2017 18:09:55 +0000

Hi Greg,

I have re-introduced the OSD that was taken out (the one that used to be a primary). I have kept debug 20 logs from both the re-introduced primary and the outgoing primary. I have used ceph-post-file to upload these, tag: 5b305f94-83e2-469c-a301-7299d2279d94

Hope this helps, let me know if you'd like me to do another test.

Thanks,

George
________________________________
From: Gregory Farnum [gfarnum@xxxxxxxxxx]
Sent: 13 December 2017 00:04
To: Vasilakakos, George (STFC,RAL,SC)
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Sudden omap growth on some OSDs

On Tue, Dec 12, 2017 at 3:36 PM <george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>> wrote:
From: Gregory Farnum <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>
Date: Tuesday, 12 December 2017 at 19:24
To: "Vasilakakos, George (STFC,RAL,SC)" <george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>>
Cc: "ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>" <ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>>
Subject: Re:  Sudden omap growth on some OSDs

On Tue, Dec 12, 2017 at 3:16 AM <george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>>> wrote:

On 11 Dec 2017, at 18:24, Gregory Farnum <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>>> wrote:

Hmm, this does all sound odd. Have you tried just restarting the primary OSD yet? That frequently resolves transient oddities like this.
If not, I'll go poke at the kraken source and one of the developers more familiar with the recovery processes we're seeing here.
-Greg

Hi Greg,

I’ve tried this, no effect. Also, on Friday, we tried removing an OSD (not the primary), the OSD that was chosen to replace it had it’s LevelDB grow to 7GiB by now. Yesterday it was 5.3.
We’re not seeing any errors logged by the OSDs with the default logging level either.

Do you have any comments on the fact that the primary sees the PG’s state as being different to what the peers think?

Yes. It's super weird. :p

Now, with a new primary I’m seeing the last peer in the set reporting it’s ‘active+clean’, as is the primary, all other are saying it’s ‘active+clean+degraded’ (according to PG query output).

Has the last OSD in the list shrunk down its LevelDB instance?

No, the last peer has the largest one currently part of the PG at 14GiB.

If so (or even if not), I'd try restarting all the OSDs in the PG and see if that changes things.

Will try that and report back.

If it doesn't...well, it's about to be Christmas and Luminous saw quite a bit of change in this space, so it's unlikely to get a lot of attention. :/

Yeah, this being Kraken I doubt it will get looked into deeply.

But the next step would be to gather high-level debug logs from the OSDs in question, especially as a peering action takes place.

I’ll be re-introducing the old primary this week so maybe I’ll bump the logging levels (to what?) on these OSDs and see what they come up with.

debug osd = 20

Oh!
I didn't notice you previously mentioned "custom gateways using the libradosstriper". Are those backing onto this pool? What operations are they doing?
Something like repeated overwrites of EC data could definitely have symptoms similar to this (apart from the odd peering bit.)
-Greg

Think of these as using the cluster as an object store. Most of the time we’re writing something in, reading it out anywhere from zero to thousands of times (each time running stat as well) and eventually may be deleting it. Once written, there’s no reason to be overwritten. They’re backing onto the EC pools (one per “tenant”) but the particular pool that this PG is a part of has barely seen any use. The most used one is storing petabytes and this one was barely reaching 100TiB when this came up.

Yeah, it would be about overwrites specifically, not just using the data. Congratulations, you've exceeded the range of even my WAGs. :/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com