Re: osd_pglog memory hoarding - another case

Kalle Happonen <kalle.happonen@xxxxxx> · Tue, 1 Dec 2020 16:49:17 +0200 (EET)

Quick update, restarting OSDs is not enough for us to compact the db. So we
stop the osd
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact
start the osd

It seems to fix the spillover. Until it grows again.

Cheers,
Kalle

----- Original Message -----
> From: "Kalle Happonen" <kalle.happonen@xxxxxx>
> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxx>
> Sent: Tuesday, 1 December, 2020 15:09:37
> Subject:  Re: osd_pglog memory hoarding - another case

> Hi All,
> back to this. Dan, it seems we're following exactly in your footsteps.
> 
> We recovered from our large pg_log, and got the cluster running. A week after
> our cluster was ok, we started seeing big memory increases again. I don't know
> if we had buffer_anon issues before or if our big pg_logs were masking it. But
> we started seeing bluefs spillover and buffer_anon growth.
> 
> This led to whole other series of problems with OOM killing, which probably
> resulted in mon node db growth which filled the disk, which  resulted in all
> mons going down, and a bigger mess of bringing everything up.
> 
> However. We're back. But I think we can confirm the buffer_anon growth, and
> bluefs spillover.
> 
> We now have a job that constatly writes 10k objects in a buckets and deletes
> them.
> 
> This may curb the memory growth, but I don't think it stops the problem. We're
> just testing restarting OSDs and while it takes a while, it seems it may help.
> Of course this is not the greatest fix in production.
> 
> Has anybody gleaned any new information on this issue? Things to tweaks? Fixes
> in the horizon? Other mitigations?
> 
> Cheers,
> Kalle
> 
> 
> ----- Original Message -----
>> From: "Kalle Happonen" <kalle.happonen@xxxxxx>
>> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>> Cc: "ceph-users" <ceph-users@xxxxxxx>
>> Sent: Thursday, 19 November, 2020 13:56:37
>> Subject:  Re: osd_pglog memory hoarding - another case
> 
>> Hello,
>> I thought I'd post an update.
>> 
>> Setting the pg_log size to 500, and running the offline trim operation
>> sequentially on all OSDs seems to help. With our current setup, it takes about
>> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have
>> are ~180-750, with a majority around 200, and some nodes consistently have 500
>> per OSD. The limiting factor of the recovery time seems to be our nvme, which
>> we use for rocksdb for the OSDs.
>> 
>> We haven't fully recovered yet, we're working on it. Almost all our PGs are back
>> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently
>> ~40 OSDs/1200 down.
>> 
>> It seems like the previous mention of 32kB / pg_log entry seems in the correct
>> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're
>> close to the 20 GB / OSD process.
>> 
>> For the nodes that have been trimmed, we're hovering around 100 GB/node of
>> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer
>> term data on that, and we don't know exactly how it behaves when load is
>> applied. However if we're currently at the pg_log limit of 500, adding load
>> should hopefully not increase pg_log memory consumption.
>> 
>> Cheers,
>> Kalle
>> 
>> ----- Original Message -----
>>> From: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>> Cc: "ceph-users" <ceph-users@xxxxxxx>
>>> Sent: Tuesday, 17 November, 2020 16:07:03
>>> Subject:  Re: osd_pglog memory hoarding - another case
>> 
>>> Hi,
>>> 
>>>> I don't think the default osd_min_pg_log_entries has changed recently.
>>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the
>>>> pg log length by memory -- if it is indeed possible for log entries to
>>>> get into several MB, then this would be necessary IMHO.
>>> 
>>> I've had a surprising crash course on pg_log in the last 36 hours. But for the
>>> size of each entry, you're right. I counted pg log * ODS, and did not take into
>>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD
>>> uses for pg_log was ~22GB / OSD process.
>>>   
>>> 
>>>> But you said you were trimming PG logs with the offline tool? How long
>>>> were those logs that needed to be trimmed?
>>> 
>>> The logs we are trimming were ~3000, we trimmed them to the new size of 500.
>>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to
>>> what we guess is 2-3GB but with the cluster at this state, it's hard to be
>>> specific.
>>> 
>>> Cheers,
>>> Kalle
>>> 
>>> 
>>> 
>>>> -- dan
>>>> 
>>>> 
>>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
>>>>>
>>>>> Another idea, which I don't know if has any merit.
>>>>>
>>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the
>>>>> enforcement (or default) of the minimum value change lately
>>>>> (osd_min_pg_log_entries)?
>>>>>
>>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have
>>>>> issues with memory.
>>>>>
>>>>> Cheers,
>>>>> Kalle
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> > From: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>>>> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>>>> > Cc: "ceph-users" <ceph-users@xxxxxxx>
>>>>> > Sent: Tuesday, 17 November, 2020 12:45:25
>>>>> > Subject:  Re: osd_pglog memory hoarding - another case
>>>>>
>>>>> > Hi Dan @ co.,
>>>>> > Thanks for the support (moral and technical).
>>>>> >
>>>>> > That sounds like a good guess, but it seems like there is nothing alarming here.
>>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional
>>>>> > values.
>>>>> >
>>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
>>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
>>>>> >  "pgid": "37.2b9",
>>>>> >  "ondisk_log_size": 3103,
>>>>> >  "pgid": "33.e",
>>>>> >  "ondisk_log_size": 3229,
>>>>> >  "pgid": "7.2",
>>>>> >  "ondisk_log_size": 3111,
>>>>> >  "pgid": "26.4",
>>>>> >  "ondisk_log_size": 3185,
>>>>> >  "pgid": "33.4",
>>>>> >  "ondisk_log_size": 3311,
>>>>> >  "pgid": "33.8",
>>>>> >  "ondisk_log_size": 3278,
>>>>> >
>>>>> > I also have no idea what the average size of a pg log entry should be, in our
>>>>> > case it seems it's around 8 MB (22GB/3000 entires).
>>>>> >
>>>>> > Cheers,
>>>>> > Kalle
>>>>> >
>>>>> > ----- Original Message -----
>>>>> >> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>>>> >> To: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>>>> >> Cc: "ceph-users" <ceph-users@xxxxxxx>, "xie xingguo" <xie.xingguo@xxxxxxxxxx>,
>>>>> >> "Samuel Just" <sjust@xxxxxxxxxx>
>>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28
>>>>> >> Subject: Re:  osd_pglog memory hoarding - another case
>>>>> >
>>>>> >> Hi Kalle,
>>>>> >>
>>>>> >> Do you have active PGs now with huge pglogs?
>>>>> >> You can do something like this to find them:
>>>>> >>
>>>>> >>   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
>>>>> >> select(.ondisk_log_size > 3000)'
>>>>> >>
>>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log.
>>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or
>>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
>>>>> >> might show other issues.
>>>>> >>
>>>>> >> Cheers, dan
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>>>> >>>
>>>>> >>> Hi Kalle,
>>>>> >>>
>>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur
>>>>> >>> after that incident. So I can mostly only offer moral support.
>>>>> >>>
>>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
>>>>> >>> think this is suspicious:
>>>>> >>>
>>>>> >>>    b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
>>>>> >>>
>>>>> >>>    https://github.com/ceph/ceph/commit/b670715eb4
>>>>> >>>
>>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if
>>>>> >>> there could be an unforeseen condition where `last_update_ondisk`
>>>>> >>> isn't being updated correctly, and therefore the osd stops trimming
>>>>> >>> the pg_log altogether.
>>>>> >>>
>>>>> >>> Xie or Samuel: does that sound possible?
>>>>> >>>
>>>>> >>> Cheers, Dan
>>>>> >>>
>>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
>>>>> >>> >
>>>>> >>> > Hello all,
>>>>> >>> > wrt:
>>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>>>>> >>> >
>>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above.
>>>>> >>> >
>>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node.
>>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>>>>> >>> >
>>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory
>>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000.
>>>>> >>> > The user traffic doesn't seem to have been exceptional lately.
>>>>> >>> >
>>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory
>>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30
>>>>> >>> > GB/day, until the servers started OOM killing OSD processes.
>>>>> >>> >
>>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process
>>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then
>>>>> >>> > the cluster was in an unstable situation. This is significantly more than the
>>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the
>>>>> >>> > size.
>>>>> >>> >
>>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can,
>>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some
>>>>> >>> > nodes, but we're  still recovering, and have a lot of ODSs down and out still.
>>>>> >>> >
>>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered
>>>>> >>> > this (or something unrelated we don't see).
>>>>> >>> >
>>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size
>>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated.
>>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to
>>>>> >>> > put a data point out there for other debuggers.
>>>>> >>> >
>>>>> >>> > Cheers,
>>>>> >>> > Kalle Happonen
>>>>> >>> > _______________________________________________
>>>>> >>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> > _______________________________________________
>>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx