Re: bluestore_cache_size_ssd and bluestore_cache_size_hdd default values

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Sat, 17 Mar 2018 18:06:45 +0100 (CET)

Hi,

on my ssd cluster (1,6TB intel s3610), I'm seeing 1G RSS memory on filestore vs 7,5-8,5GB on bluestore. (with default ceph.conf, no tuning).

Currently, I'm restart my osd each 2week to avoid out of memory. 

is it a normal ? I'm far from 3G memory by osd.

filestore jewel
---------------
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph      48957 13.9  1.6 2984276 1097996 ?     Ssl   2017 65408:50 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph

bluestore luminous
------------------

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ceph     1718009  2.5 11.7 8542012 7725992 ?     Ssl   2017 2463:28 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph

# ceph daemon osd.5 dump_mempools 
{
    "bloom_filter": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_alloc": {
        "items": 98449088,
        "bytes": 98449088
    },
    "bluestore_cache_data": {
        "items": 759,
        "bytes": 17276928
    },
    "bluestore_cache_onode": {
        "items": 884140,
        "bytes": 594142080
    },
    "bluestore_cache_other": {
        "items": 116375567,
        "bytes": 2072801299
    },
    "bluestore_fsck": {
        "items": 0,
        "bytes": 0
    },
    "bluestore_txc": {
        "items": 6,
        "bytes": 4320
    },
    "bluestore_writing_deferred": {
        "items": 99,
        "bytes": 1190045
    },
    "bluestore_writing": {
        "items": 11,
        "bytes": 4510159
    },
    "bluefs": {
        "items": 1202,
        "bytes": 64136
    },
    "buffer_anon": {
        "items": 76863,
        "bytes": 21327234
    },
    "buffer_meta": {
        "items": 910,
        "bytes": 80080
    },
    "osd": {
        "items": 328,
        "bytes": 3956992
    },
    "osd_mapbl": {
        "items": 0,
        "bytes": 0
    },
    "osd_pglog": {
        "items": 1118050,
        "bytes": 286277600
    },
    "osdmap": {
        "items": 6073,
        "bytes": 551872
    },
    "osdmap_mapping": {
        "items": 0,
        "bytes": 0
    },
    "pgmap": {
        "items": 0,
        "bytes": 0
    },
    "mds_co": {
        "items": 0,
        "bytes": 0
    },
    "unittest_1": {
        "items": 0,
        "bytes": 0
    },
    "unittest_2": {
        "items": 0,
        "bytes": 0
    },
    "total": {
        "items": 216913096,
        "bytes": 3100631833
    }
}

----- Mail original -----
De: "Mark Nelson" <mark.a.nelson@xxxxxxxxx>
À: "Sage Weil" <sage@xxxxxxxxxxxx>, "Wido den Hollander" <wido@xxxxxxxx>
Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
Envoyé: Vendredi 16 Mars 2018 17:41:03
Objet: Re: bluestore_cache_size_ssd and bluestore_cache_size_hdd default values

On 03/16/2018 11:08 AM, Sage Weil wrote: 
> On Fri, 16 Mar 2018, Wido den Hollander wrote: 
>> Hi, 
>> 
>> The config values bluestore_cache_size_ssd and bluestore_cache_size_hdd 
>> determine how much memory a OSD running with Bluestore will use for caching. 
>> 
>> By default the values are: 
>> 
>> bluestore_cache_size_ssd = 3GB 
>> bluestore_cache_size_hdd = 1GB 
>> 
>> I've seen some cases recently where users migrated from FileStore to 
>> BlueStore and had the OOM-killer come along during backfill/recovery 
>> siautions. These are the situations where OSDs require more memory. 
>> 
>> It's not uncommon to find servers with: 
>> 
>> - 8 SSDs and 32GB RAM 
>> - 16 SSDs and 64GB RAM 
>> 
>> With FileStore it was sufficient since the page cache did all the work, 
>> but with BlueStore each OSD has it's own cache which isn't shared. 
>> 
>> In addition there is the regular memory consumption and the overhead of 
>> the cache. 
>> 
>> I also don't understand the ideas behind the values. As HDDs are slower 
>> the usually require more cache then SSDs, so I'd expect the values to be 
>> flipped. 
>> 
>> My recommendation would be to lower the value to 1GB to prevent users 
>> from having a bad experience when going from FileStore to BlueStore. 
>> 
>> I have created a pull request for this: 
>> https://github.com/ceph/ceph/pull/20940 
>> 
>> Opinions, experiences, feedback? 
> 
> The thinking was that bluestore requires some deliberate thinking 
> and tuning on the cache size, so we may as well pick defaults that make 
> sense. Since the admin is doing the filestore -> bluestore conversion, 
> that is the point where they consider the memory requirement and adjust 
> the config as necessary. 
> 
> As for why the defaults are different, the SSDs need a larger cache to 
> capture the SSD performance, and the nodes that have them are likely to be 
> "higher end" and have more memory. The idea is the minimize the number 
> of people that will need to adjust their config. 
> 
> Perhaps the missing piece here is that the filestore->bluestore conversion 
> doc should have a section about memory requirements and tuning 
> bluestore_cache_size accordingly? If we just reduce the default to 
> satisfy the lowest common denominator we'll kill performance for the 
> majority that has more memory. 

On a side note, we are not currently enforcing a hard cap on rocksdb 
block cache usage. During certain test scenarios, I've observed the 
block cache exceeding the soft cap during compaction. I suspect this is 
primarily an issue when dealing with very fast storage and very low 
memory, but it may contribute to scenarios where folks are going OOM on 
low memory configurations. 

Mark 

> 
> sage 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@xxxxxxxxxxxxxxx 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html