From: Gregory Farnum <gfarnum@xxxxxxxxxx> Date: Tuesday, 12 December 2017 at 19:24 To: "Vasilakakos, George (STFC,RAL,SC)" <george.vasilakakos@xxxxxxxxxx> Cc: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx> Subject: Re: Sudden omap growth on some OSDs On Tue, Dec 12, 2017 at 3:16 AM <george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>> wrote: On 11 Dec 2017, at 18:24, Gregory Farnum <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>> wrote: Hmm, this does all sound odd. Have you tried just restarting the primary OSD yet? That frequently resolves transient oddities like this. If not, I'll go poke at the kraken source and one of the developers more familiar with the recovery processes we're seeing here. -Greg Hi Greg, I’ve tried this, no effect. Also, on Friday, we tried removing an OSD (not the primary), the OSD that was chosen to replace it had it’s LevelDB grow to 7GiB by now. Yesterday it was 5.3. We’re not seeing any errors logged by the OSDs with the default logging level either. Do you have any comments on the fact that the primary sees the PG’s state as being different to what the peers think? Yes. It's super weird. :p Now, with a new primary I’m seeing the last peer in the set reporting it’s ‘active+clean’, as is the primary, all other are saying it’s ‘active+clean+degraded’ (according to PG query output). Has the last OSD in the list shrunk down its LevelDB instance? No, the last peer has the largest one currently part of the PG at 14GiB. If so (or even if not), I'd try restarting all the OSDs in the PG and see if that changes things. Will try that and report back. If it doesn't...well, it's about to be Christmas and Luminous saw quite a bit of change in this space, so it's unlikely to get a lot of attention. :/ Yeah, this being Kraken I doubt it will get looked into deeply. But the next step would be to gather high-level debug logs from the OSDs in question, especially as a peering action takes place. I’ll be re-introducing the old primary this week so maybe I’ll bump the logging levels (to what?) on these OSDs and see what they come up with. Oh! I didn't notice you previously mentioned "custom gateways using the libradosstriper". Are those backing onto this pool? What operations are they doing? Something like repeated overwrites of EC data could definitely have symptoms similar to this (apart from the odd peering bit.) -Greg Think of these as using the cluster as an object store. Most of the time we’re writing something in, reading it out anywhere from zero to thousands of times (each time running stat as well) and eventually may be deleting it. Once written, there’s no reason to be overwritten. They’re backing onto the EC pools (one per “tenant”) but the particular pool that this PG is a part of has barely seen any use. The most used one is storing petabytes and this one was barely reaching 100TiB when this came up. This problem is quite weird I think. I copied a LevelDB and dumped a key list; the largest in GiB had 66% the number of keys that the average LevelDB has. The main difference with the ones that have been around for a while is that they have a lot more files that were last touched on the days when the problem started but most other LevelDBs have compacted those away and only have about 7 days old files (as opposed to 3 week old ones that the big ones keep around). The big ones do seem to do compactions, they just don’t seem to get rid of that stuff. Thanks, George On Fri, Dec 8, 2017 at 7:30 AM <george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>>> wrote: ________________________________ From: Gregory Farnum [gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>] Sent: 07 December 2017 21:57 To: Vasilakakos, George (STFC,RAL,SC) Cc: drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx>>; ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>> Subject: Re: Sudden omap growth on some OSDs On Thu, Dec 7, 2017 at 4:41 AM <george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>>>> wrote: ________________________________ From: Gregory Farnum [gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>>] Sent: 06 December 2017 22:50 To: David Turner Cc: Vasilakakos, George (STFC,RAL,SC); ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>><mailto:ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>>> Subject: Re: Sudden omap growth on some OSDs On Wed, Dec 6, 2017 at 2:35 PM David Turner <drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx>><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx>>><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx>><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx><mailto:drakonstein@xxxxxxxxx<mailto:drakonstein@xxxxxxxxx>>>>> wrote: I have no proof or anything other than a hunch, but OSDs don't trim omaps unless all PGs are healthy. If this PG is actually not healthy, but the cluster doesn't realize it while these 11 involved OSDs do realize that the PG is unhealthy... You would see this exact problem. The OSDs think a PG is unhealthy so they aren't trimming their omaps while the cluster doesn't seem to be aware of it and everything else is trimming their omaps properly. I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are stored in leveldb, but they have different trimming rules. I don't know what to do about it, but I hope it helps get you (or someone else on the ML) towards a resolution. On Wed, Dec 6, 2017 at 1:59 PM <george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>>><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx><mailto:george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx>>>>> wrote: Hi ceph-users, We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour. A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at around 20G size). The one thing they have in common is the 11 disks with inflating LevelDBs are all in the set for one PG in one of our pools (EC 8+3). This pool started to see use around the time the LevelDBs started inflating. Compactions are running and they do go down in size a bit but the overall trend is one of rapid growth. The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G. This PG has nothing to separate it from the others in its pool, within 5% of average number of objects per PG, no hot-spotting in terms of load, no weird states reported by ceph status. The one odd thing about it is the pg query output mentions it is active+clean, but it has a recovery state, which it enters every morning between 9 and 10am, where it mentions a "might_have_unfound" situation and having probed all other set members. A deep scrub of the PG didn't turn up anything. You need to be more specific here. What do you mean it "enters into" the recovery state every morning? Here's what PG query showed me yesterday: "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2017-12-05 09:48:57.730385", "might_have_unfound": [ { "osd": "79(1)", "status": "already probed" }, { "osd": "337(9)", "status": "already probed" },... it goes on to list all peers of this OSD in that PG. IIRC that's just a normal thing when there's any kind of recovery happening — it builds up a set during peering of OSDs that might have data, in case it discovers stuff missing. OK. But this is the only PG mentioning "might_have_unfound" across the two most used pools in our cluster and it's the only one that has all of its omap dirs at sizes more than 15 times the average for the cluster. How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools? What are you using the cluster for? 2048 PGs in this pool, also another 2048 PG EC pool (same profile) and two more 1024 PG EC pools (also same profile). Then a set of RGW auxiliary pools with 3-way replication. I'm not 100% sure but I think all of our OSDs should have a few PGs from one of the EC pools. Our rules don't make a distinction so it's probabilistic. We're using the cluster as an object store, minor RGW use and custom gateways using libradosstriper. It's also worth pointing out that an OSD in that PG was taken out of the cluster earlier today and pg query shows the following weirdness: The primary thinks it's active+clean but, in the peer_info section all peers report it is "active+undersized+degraded+remapped+backfilling". It has shown this discrepancy before between the primary thinking it's a+c and the rest of the set seeing it a+c+degraded. Again, exactly what output makes you say the primary thinks it's active+clean but the others have more complex recovery states? I've pastebin'ed a query output: https://pastebin.com/PbGSaZyF. The PG state is reported "active+clean" in the top section but the peer_info sections have it "active+undersized+degraded". I may be reading the query output wrong but this seems odd. In the query output we're showing the following for recovery state: "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2017-12-07 08:41:57.850220", "might_have_unfound": [], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "MIN", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "recovery_ops": [], "read_ops": [] } The cluster is now starting to manifest slow requests on the OSDs with the large LevelDBs, although not in the particular PG. Well, there have been a few causes of large LevelDBs, but given that you have degraded PGs and a bunch of EC pools, my guess is that the PG logs are getting extended thanks to the PG states. EC PG logs can be much larger than replicated ones, since EC pools need to be able to reverse the IO in those cases. So you need to get your PGs clean first and then see if the LevelDB shrinks down or not. To be expected I guess. However, the average LevelDB in this cluster is around 200MB, all OSDs in this PG are in the multi-gigabyte range with a couple pushing the 20GB mark. The numbers on this PG do not otherwise stand out, so I'm wondering what makes this PG so special. No PGs on these OSDs are reporting any other state than active+clean. Thanks, George _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com