Re: Large OSD omap directories (LevelDBs)

<george.vasilakakos@xxxxxxxxxx> · Wed, 24 May 2017 14:49:05 +0000

Hi Greg,

> This does sound weird, but I also notice that in your earlier email you
> seemed to have only ~5k PGs across  ~1400 OSDs, which is a pretty
> low number. You may just have a truly horrible PG balance; can you share
> more details (eg ceph osd df)?

Our distribution is pretty bad, we're getting close to the point where the most filled disk is getting close to the nearfull ratio and already has more than twice as much data as the cluster fill ratio. My view is that we need to at least double the PG count across the cluster. Here's some data: https://pastebin.com/qX0LXxid

However, I think this particular issue is down to compaction problems. The oldest SST files in the largest LevelDBs date back to Feb 21 (the oldest files in normal-sized LevelDBs are no more than a week old):

# du -sh /var/lib/ceph/osd/ceph-348/current/omap/
66G	/var/lib/ceph/osd/ceph-348/current/omap/
# ll -t /var/lib/ceph/osd/ceph-348/current/omap/ | tail
-rw-r--r--. 1 ceph ceph  2109703 Feb 21 01:07 013472.sst
-rw-r--r--. 1 ceph ceph  2104172 Feb 21 01:07 013470.sst
-rw-r--r--. 1 ceph ceph  2102942 Feb 21 01:07 013468.sst
-rw-r--r--. 1 ceph ceph  2102906 Feb 21 01:04 013446.sst
-rw-r--r--. 1 ceph ceph  2102977 Feb 21 01:04 013444.sst
-rw-r--r--. 1 ceph ceph  2102667 Feb 21 01:04 013442.sst
-rw-r--r--. 1 ceph ceph  2102903 Feb 21 01:04 013440.sst
-rw-r--r--. 1 ceph ceph      172 Jan  6 15:45 LOG
-rw-r--r--. 1 ceph ceph       57 Jan  6 15:45 LOG.old
-rw-r--r--. 1 ceph ceph        0 Jan  6 15:45 LOCK

The corresponding daemon has been running for a while:

# systemctl status ceph-osd@348
● ceph-osd@348.service - Ceph object storage daemon osd.348
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-03-13 14:23:27 GMT; 2 months 11 days ago

This is confirmed as being the case for the top 3 largest LevelDBs.

Given the inflation that was observed while the OSD was reporting compacting operations I thought this might be a compaction issue.

I have performed the following test:

I chose osd.101 which had an average sized LevelDB and proceeded to extract that LevelDB and poke around a bit.

- the size of osd.101's omap directory was 25M
- it contained 99627 keys
- when compacted it went down to 15M

By comparison, osd.980's omap directory:

- was 67G in size
- it contained 101773 keys
- when compacted it went down to 44M

Both omaps had similar key and value sizes.

We do not have any options regarding OSD LevelDB compaction set in ceph.conf so OSDs seem to be compacting when they see fit. This seems to be mostly working. What's troubling is the fact that many of these LevelDBs go into a compacting frenzy, where the OSD spends upwards of an hour compacting a LevelDB, during which time the LevelDB is actually exploding in size and then remains at that size for at least a couple of weeks.

This seems a bit similar to http://tracker.ceph.com/issues/13990 which Dan van der Ster pointed out, although not quite the same behaviour. Is there a way we can try to trigger the OSD to do compaction and/or manually do it and see what happens? How risky is this (this is our production service after all)?

Thanks,

George
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com