Re: OSDs get killed by OOM when other host goes down

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 16 Nov 2021 14:49:15 -0600

Yeah, if it's not memory reported by the mempools, that means it's 
something we aren't tracking.  Perhaps temporary allocations in some 
dark corner of the code, or possibly rocksdb (though 38GB of ram is 
obviously excessive).  heap stats are a good idea.  it's possible if 
neither the heap stats nor the mempool stats are helpful (and if debug 
bluestore = 5 and debug prioritycache =5 doesn't indicate any obvious 
problems with autotuning code), it may require valgrind or some other 
method to figure out where the memory is going.  If the memory is 
growing rapidly it's possible that wallclock profiling may help if you 
can catch where the allocations are being made.

Mark

On 11/16/21 2:42 PM, Josh Baergen wrote:
Hi Marius,

However, docker stats reports 38GB for that container.
There is a huge gap between what RAM is being used by the container what ceph daemon osd.xxx dump_mempools reports.
Take a look at "ceph daemon osd.XX heap stats" and see what it says.
You might try "ceph daemon osd.XX heap release"; I didn't think that
was supposed to be necessary with Bluestore, though. This is reaching
the end of the sort of problems I know how to track down, though, so
maybe others have some ideas.

How can I check if trim happens?
I'm not sure how to dig into this, but if your "up" count = your "in"
count in "ceph -s", it should be trimming.

Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx