Re: Nautilus: (Minority of) OSDs with huge buffer_anon usage - triggering OOMkiller in worst cases.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sam and All,


Adam did some digging and we've got a preliminary theory.  Last summer we changed the way the bluestore cache does trimming. Previously we used the mempool thread in bluestore to periodically trim the bluestore caches every 50ms or so  At the time we would also figure out how many onodes should be kept around based on the overall size of the onodes, extents, blobs, etc.  This was slow and prone to temporary memory spikes, but was reliable in that you had a regular trim with a synchronized space calculation so long as it could obtain cache shard locks.


We changed it so that we trim-on-write in this PR: https://github.com/ceph/ceph/pull/28597. ; This has a couple of advantages.  We already hold the cache shard lock anyway for the insertion, so doing a trim at that time is relatively cheap.  This also lets us make sure that we trim off however many bytes we need to store the new onode (so in typical cases we have fewer temporary spikes).  Finally, we were able to split the buffer and onode caches into their own entities so that we trim those caches individually and avoid some lock contention.


Adam uncovered two issues:

1) Since we de-coupled the act of trimming and onode size calculation, it's now possible to hit corner cases where you create a bunch of empty objects that have tiny onodes, then don't touch the onode cache since those onodes are already there (ie no more new insertions).  In the mean time, the metadata size of those objects increases as data is written (more extents, blobs, etc), but since we aren't actually inserting any new onodes, we don't end up trimming anything.  Ultimately all of this stems from the fact that the onode cache doesn't really know anything about how much memory is used when any given onode is cached.  It just knows how many onodes are in it and a maximum number of onodes it should keep in the cache.  The reason we didn't see this is that usually the average size of onodes shouldn't change dramatically and any new onode insertion into the cache will trigger a trim. But....


2) Adam raw some tests with compression and saw a case where a 32MB object appeared to be using 1.1MB of memory for metadata (normally we would use on the order of bytes or kbytes).  If that's true, it may be the reason in conjuction with 1) that people are seeing such massive growth here.  Typically we use far less memory for object metadata, so this is likely a bug.


A very quick and easy way to fix 1 would be to just insert a trim every time we recalculate the number of onodes to cache.  I have mixed feelings about that as it sort of puts us back into the position of doing this periodic background trim.  Maybe a slightly better option would be to do the calculation at a high frequency and then issue a background trim if we exceed a threshold.  A more complete solution would be to trigger the metadata resize calculation every time an object is modified and then trigger the onode trim once we've exceeded a threshold.  That would be a better fit with the new model where we are trying to avoid letting things grow unbounded between mempool thread cycles, though it would probably be a bit more work.


Mark


On 5/21/20 8:18 AM, Mark Nelson wrote:
Hi Sam,


I saw your comment in the other thread but wanted to reply here since you provided the mempool and perf counters.  It looks like the priority cache is (like in Harald's case) shrinking all of the caches to their smallest values trying to compensate for all of the stuff collecting in buffer_anon.  Notice how there are only ~8000 items in the onode cache and 127 items in the data cache. This is just another indication that something isn't being cleaned up properly in buffer_anon.


I don't see a new tracker ticket from Harald, would you mind creating one for this and include the relevant information from your cluster?  That would be most helpful: https://tracker.ceph.com/


On a side note, we haven't seen this in our test framework so there must be some specific combination of workload and settings causing it.


Thanks,

Mark


On 5/21/20 5:28 AM, aoanla@xxxxxxxxx wrote:
Hi,

Following on from various woes, we see an odd and unhelpful behaviour with some OSDs on our cluster currently. A minority of OSDs seem to have runaway memory usage, rising to 10s of GB, whilst other OSDs on the same host behave sensibly. This started when we moved from Mimic -> Nautilus, as far as we can tell.

In best case, this causes some nodes to start swapping [and reduces their performance]. In worst case, it triggers the OOMkiller.

I have dumped the mempool for these OSDs, which shows that almost all the memory is in the buffer_anon pool. The perf dump shows that the OSD is targetting the 4GB limit that's set for it, but for some reason is unable to reach this due to stuff in the priorty_cache (which seems to be mostly what is filling buffer_anon)

Can anyone advise on what we should do next?

(mempool dump and excerpt of perf dump at end of email).

Thanks for any help,

Sam Skipsey

MEMPOOL DUMP
{
     "mempool": {
         "by_pool": {
             "bloom_filter": {
                 "items": 0,
                 "bytes": 0
             },
             "bluestore_alloc": {
                 "items": 5629372,
                 "bytes": 45034976
             },
             "bluestore_cache_data": {
                 "items": 127,
                 "bytes": 65675264
             },
             "bluestore_cache_onode": {
                 "items": 8275,
                 "bytes": 4634000
             },
             "bluestore_cache_other": {
                 "items": 2967913,
                 "bytes": 62469216
             },
             "bluestore_fsck": {
                 "items": 0,
                 "bytes": 0
             },
             "bluestore_txc": {
                 "items": 145,
                 "bytes": 100920
             },
             "bluestore_writing_deferred": {
                 "items": 335,
                 "bytes": 13160884
             },
             "bluestore_writing": {
                 "items": 1406,
                 "bytes": 5379120
             },
             "bluefs": {
                 "items": 1105,
                 "bytes": 24376
             },
             "buffer_anon": {
                 "items": 13705143,
                 "bytes": 40719040439
             },
             "buffer_meta": {
                 "items": 6820143,
                 "bytes": 600172584
             },
             "osd": {
                 "items": 96,
                 "bytes": 1138176
             },
             "osd_mapbl": {
                 "items": 59,
                 "bytes": 7022524
             },
             "osd_pglog": {
                 "items": 491049,
                 "bytes": 156701043
             },
             "osdmap": {
                 "items": 107885,
                 "bytes": 1723616
             },
             "osdmap_mapping": {
                 "items": 0,
                 "bytes": 0
             },
             "pgmap": {
                 "items": 0,
                 "bytes": 0
             },
             "mds_co": {
                 "items": 0,
                 "bytes": 0
             },
             "unittest_1": {
                 "items": 0,
                 "bytes": 0
             },
             "unittest_2": {
                 "items": 0,
                 "bytes": 0
             }
         },
         "total": {
             "items": 29733053,
             "bytes": 41682277138
         }
     }
}

PERF DUMP excerpt:

"prioritycache": {
         "target_bytes": 4294967296,
         "mapped_bytes": 38466584576,
         "unmapped_bytes": 425984,
         "heap_bytes": 38467010560,
         "cache_bytes": 134217728
     },
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux