Re: OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 24 Jun 2019 09:29:45 -0700

On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC
<tom.byrne@xxxxxxxxxx> wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our production cluster, so have had a number of pools on them with many, many objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in clear_temp_objects function, even for brand new, empty PGs. The OSD is hammering the disk during the clear_temp_objects, with a constant ~30MB/s read and all available IOPS consumed. The OSD will finish booting and come up fine, but will then start hammering the disk again and fall over at some point later, causing the cluster to gradually fall apart. I'm guessing something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot quickly and stay up, but creating a pool will cause OSDs that get even a single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is acceptable for this cluster, but I'm a little concerned a similar thing could happen on a production cluster. Ideally, I would like to try and understand what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to dig further into this?

Have you tried a manual compaction? The only other time I see this
being reported was for FileStore-on-ZFS and it was just very slow at
metadata scanning for some reason. (" Hammer to Jewel
Upgrade - Extreme OSD Boot Time") There has been at least one PR about
object listings being slow in BlueStore when there are a lot of
deleted objects, which would match up with your many deleted
pools/objects.

If you have any debug logs the BlueStore devs might be interested in
them to check if the most recent patches will fix it.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com