Re: OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I hadn't tried manual compaction, but it did the trick. The db shrunk down to 50MB and the OSD booted instantly. Thanks!

I'm confused as to why the OSDs weren't doing this themselves, especially as the operation only took a few seconds. But for now I'm happy that this is easy to rectify if we run into it again.

I've uploaded the log of a slow boot with debug_bluestore turned up [1], and I can provide other logs/files if anyone thinks they could be useful.

Cheers,
Tom
 
[1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446

-----Original Message-----
From: Gregory Farnum <gfarnum@xxxxxxxxxx> 
Sent: 24 June 2019 17:30
To: Byrne, Thomas (STFC,RAL,SC) <tom.byrne@xxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC <tom.byrne@xxxxxxxxxx> wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our production cluster, so have had a number of pools on them with many, many objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in clear_temp_objects function, even for brand new, empty PGs. The OSD is hammering the disk during the clear_temp_objects, with a constant ~30MB/s read and all available IOPS consumed. The OSD will finish booting and come up fine, but will then start hammering the disk again and fall over at some point later, causing the cluster to gradually fall apart. I'm guessing something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot quickly and stay up, but creating a pool will cause OSDs that get even a single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is acceptable for this cluster, but I'm a little concerned a similar thing could happen on a production cluster. Ideally, I would like to try and understand what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to dig further into this?

Have you tried a manual compaction? The only other time I see this being reported was for FileStore-on-ZFS and it was just very slow at metadata scanning for some reason. (" Hammer to Jewel Upgrade - Extreme OSD Boot Time") There has been at least one PR about object listings being slow in BlueStore when there are a lot of deleted objects, which would match up with your many deleted pools/objects.

If you have any debug logs the BlueStore devs might be interested in them to check if the most recent patches will fix it.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux