Hi Greg, > This does sound weird, but I also notice that in your earlier email you > seemed to have only ~5k PGs across ~1400 OSDs, which is a pretty > low number. You may just have a truly horrible PG balance; can you share > more details (eg ceph osd df)? Our distribution is pretty bad, we're getting close to the point where the most filled disk is getting close to the nearfull ratio and already has more than twice as much data as the cluster fill ratio. My view is that we need to at least double the PG count across the cluster. Here's some data: https://pastebin.com/qX0LXxid However, I think this particular issue is down to compaction problems. The oldest SST files in the largest LevelDBs date back to Feb 21 (the oldest files in normal-sized LevelDBs are no more than a week old): # du -sh /var/lib/ceph/osd/ceph-348/current/omap/ 66G /var/lib/ceph/osd/ceph-348/current/omap/ # ll -t /var/lib/ceph/osd/ceph-348/current/omap/ | tail -rw-r--r--. 1 ceph ceph 2109703 Feb 21 01:07 013472.sst -rw-r--r--. 1 ceph ceph 2104172 Feb 21 01:07 013470.sst -rw-r--r--. 1 ceph ceph 2102942 Feb 21 01:07 013468.sst -rw-r--r--. 1 ceph ceph 2102906 Feb 21 01:04 013446.sst -rw-r--r--. 1 ceph ceph 2102977 Feb 21 01:04 013444.sst -rw-r--r--. 1 ceph ceph 2102667 Feb 21 01:04 013442.sst -rw-r--r--. 1 ceph ceph 2102903 Feb 21 01:04 013440.sst -rw-r--r--. 1 ceph ceph 172 Jan 6 15:45 LOG -rw-r--r--. 1 ceph ceph 57 Jan 6 15:45 LOG.old -rw-r--r--. 1 ceph ceph 0 Jan 6 15:45 LOCK The corresponding daemon has been running for a while: # systemctl status ceph-osd@348 ● ceph-osd@348.service - Ceph object storage daemon osd.348 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2017-03-13 14:23:27 GMT; 2 months 11 days ago This is confirmed as being the case for the top 3 largest LevelDBs. Given the inflation that was observed while the OSD was reporting compacting operations I thought this might be a compaction issue. I have performed the following test: I chose osd.101 which had an average sized LevelDB and proceeded to extract that LevelDB and poke around a bit. - the size of osd.101's omap directory was 25M - it contained 99627 keys - when compacted it went down to 15M By comparison, osd.980's omap directory: - was 67G in size - it contained 101773 keys - when compacted it went down to 44M Both omaps had similar key and value sizes. We do not have any options regarding OSD LevelDB compaction set in ceph.conf so OSDs seem to be compacting when they see fit. This seems to be mostly working. What's troubling is the fact that many of these LevelDBs go into a compacting frenzy, where the OSD spends upwards of an hour compacting a LevelDB, during which time the LevelDB is actually exploding in size and then remains at that size for at least a couple of weeks. This seems a bit similar to http://tracker.ceph.com/issues/13990 which Dan van der Ster pointed out, although not quite the same behaviour. Is there a way we can try to trigger the OSD to do compaction and/or manually do it and see what happens? How risky is this (this is our production service after all)? Thanks, George _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com