Re: OSDs taking too much memory, for pglog

Wido den Hollander <wido@xxxxxxxx> · Mon, 18 May 2020 08:44:06 +0200

On 5/17/20 4:49 PM, Harald Staub wrote:
> tl;dr: this cluster is up again, thank you all (Mark, Wout, Paul
> Emmerich off-list)!
> 

Awesome!

> First we tried to lower max- and min_pg_log_entries on a single running
> OSD, without and with restarting it. There was no effect. Maybe because
> of the unclean state of the cluster.
> 
> Then we tried ceph-objectstore-tool trim-pg-log on an offline OSD. This
> has to be called per PG that is stored on the OSD. At first it seemed to
> be much too slow, took around 20 minutes. But the following PGs were
> much faster (like 1 minute). The trim part of the command was always
> fast, but the compaction part took a long time the first time.
> 
> CEPH_ARGS="--osd-min-pg-log-entries=1500 --osd-max-pg-log-entries=1500"
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$OSD --pgid $pg
> --op trim-pg-log
> 
> Thanks to these pglog trimmings, memory consumption was reduced and we
> could bring up all OSDs. Then recovery was quite fast, no big
> backfilling. We checked a bit later in the evening, there was plenty of
> free RAM.
> 
> Next morning, again free memory was very tight. Although it looked
> differently. dump mempools showed buffer_anon as biggest (should this
> not be tuned down by "osd memory target"?). But also osd_pglog (although
> all PGs were active+clean?).
> 
> Soon there was another OOM killer. Again, we treated this OSD with
> trim-pg-log to bring it back.
> 
> Then we decided to try again to reduce the pg_log parameters, cluster
> wide (from default 3000 to 2000). This time it worked, memory was
> released :-)
> 
> Then we added some RAM to get more to the safe side.
> 
> Some more background. As already mentioned, the number of PGs per OSD is
> ok, but there is a lot of small objects (nearly 1 billion), mostly S3,
> in an EC pool 8+3. So the number of the objects lieing on the OSDs
> (chunks? shards?) is about 10 billions in total. Per OSD (510 of type
> hdd) this is probably quite a lot. Maybe also a reason for high pglog
> demand. And it is not equally distributed, HDDs are 4 TB and 8 TB.
> 

Small files/objects are always a problem. They were when I was still
fiddling with NFS servers which stored PHP websites, but they still are
in modern systems.

Each object becomes an entry in BlueStore's (Rocks)DB and that can cause
all kinds of slowdowns and other issues.

I would always advise to set quotas on systems to prevent unbounded
growth of small objects in Ceph. CephFS and RGW both have such mechanisms.

> Another point: the DB devices lie on SSDs. But they are too small
> nowadays, the sizing was done years ago, for Filestore.
> 

~30GB per OSD is sufficient at the moment with RocksDB's settings. The
next step is 300GB, see:
https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#levels-target-size

> Last not least, probably the trigger was a broken HDD on the Sunday
> before. Rebalancing then takes several days and was ongoing when the
> problems started.
> 
> Cheers
>  Harry
> 
> On 14.05.20 11:08, Wout van Heeswijk wrote:
>> Hi Harald,
>>
>> Your cluster has a lot of objects per osd/pg and the pg logs will grow
>> fast and large because of this. The pg_logs will keep growing as long
>> as you're clusters pgs are not active+clean. This means you are now in
>> a loop where you cannot get stable running OSDs because the pg_logs
>> take too much memory, and therefore the OSDs cannot purge the pg_logs...
>>
>> I suggest you lower the values for both the osd_min_pg_log_entries and
>> the osd_max_pg_log_entries. Lowering these values will cause Ceph to
>> go into backfilling much earlier, but the memory usage of the OSDs
>> will go down significantly enabling them to run stable. The default is
>> 3000 for both of these values.
>>
>> You can lower them to 500 by executing:
>>
>> ceph config set osd osd_min_pg_log_entries 500
>> ceph config set osd osd_max_pg_log_entries 500
>>
>> When you lower these values, you will get more backfilling instead of
>> recoveries but I think it will help you get through this situation.
>>
>> kind regards,
>>
>> Wout
>> 42on
>>
>> On 13-05-2020 07:27, Harald Staub wrote:
>>> Hi Mark
>>>
>>> Thank you for your feedback!
>>>
>>> The maximum number of PGs per OSD is only 123. But we have PGs with a
>>> lot of objects. For RGW, there is an EC pool 8+3 with 1024 PGs with
>>> 900M objects, maybe this is the problematic part. The OSDs are 510
>>> hdd, 32 ssd.
>>>
>>> Not sure, do you suggest to use something like
>>> ceph-objectstore-tool --op trim-pg-log ?
>>>
>>> When done correctly, would the risk be a lot of backfilling? Or also
>>> data loss?
>>>
>>> Also, to get up the cluster is one thing, to keep it running seems to
>>> be a real challenge right now (OOM killer) ...
>>>
>>> Cheers
>>>  Harry
>>>
>>> On 13.05.20 07:10, Mark Nelson wrote:
>>>> Hi Herald,
>>>>
>>>>
>>>> Changing the bluestore cache settings will have no effect at all on
>>>> pglog memory consumption.  You can try either reducing the number of
>>>> PGs (you might want to check and see how many PGs you have and
>>>> specifically how many PGs on that OSD), or decrease the number of
>>>> pglog entries per PG.  Keep in mind that fewer PG log entries may
>>>> impact recovery.  FWIW, 8.5GB of memory usage for pglog implies that
>>>> you have a lot of PGs per OSD, so that's probably the first place to
>>>> look.
>>>>
>>>>
>>>> Good luck!
>>>>
>>>> Mark
>>>>
>>>>
>>>> On 5/12/20 5:10 PM, Harald Staub wrote:
>>>>> Several OSDs of one of our clusters are down currently because RAM
>>>>> usage has increased during the last days. Now it is more than we
>>>>> can handle on some systems. Frequently OSDs get killed by the OOM
>>>>> killer. Looking at "ceph daemon osd.$OSD_ID dump_mempools", it
>>>>> shows that nearly all (about 8.5 GB) is taken by osd_pglog, e.g.
>>>>>
>>>>>             "osd_pglog": {
>>>>>                 "items": 461859,
>>>>>                 "bytes": 8445595868
>>>>>             },
>>>>>
>>>>> We tried to reduce it, with "osd memory target" and even with
>>>>> "bluestore cache autotune = false" (together with "bluestore cache
>>>>> size hdd"), but there was no effect at all.
>>>>>
>>>>> I remember the pglog_hardlimit parameter, but that is already set
>>>>> by default with Nautilus I read. I.e. this is on Nautilus, 14.2.8.
>>>>>
>>>>> Is there a way to limit this pglog memory?
>>>>>
>>>>> Cheers
>>>>>  Harry
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx