Re: osd_pglog memory hoarding - another case

Kalle Happonen <kalle.happonen@xxxxxx> · Tue, 22 Dec 2020 14:56:05 +0200 (EET)

For anybody facing similar issues, we wrote a blog post about everything we faced, and how we worked through it.

https://cloud.blog.csc.fi/2020/12/allas-november-2020-incident-details.html

Cheers,
Kalle

----- Original Message -----
> From: "Kalle Happonen" <kalle.happonen@xxxxxx>
> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxx>
> Sent: Monday, 14 December, 2020 10:25:32
> Subject:  Re: osd_pglog memory hoarding - another case

> Hi all,
> Ok, so I have some updates on this.
> 
> We noticed that we had a bucket with tons of RGW garbage collection pending. It
> was growing faster than we could clean it up.
> 
> We suspect this was because users tried to do "s3cmd sync" operations on SWIFT
> uploaded large files. This could logically cause issues as s3 and SWIFT
> calculate md5sums differently on large objects.
> 
> The following command showed the pending gc, and also shows which buckets are
> affected.
> 
> radosgw-admin gc list |grep oid >garbagecollectionlist.txt
> 
> Our total RGW GC backlog was up to ~40 M.
> 
> We stopped the main s3sync workflow which was affecting the GC growth. Then we
> started running more aggressive radosgw garbage collection.
> 
> This really helped with the memory use. It dropped a lot, and for now *knock on
> wood* when the GC has been cleaned up, the memory has stayed at a more stable
> lower level.
> 
> So we hope we found the (or a) trigger for the problem.
> 
> Hopefully reveals another thread to pull for others debugging the same issue
> (and for us when we hit it again).
> 
> Cheers,
> Kalle
> 
> ----- Original Message -----
>> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>> To: "Kalle Happonen" <kalle.happonen@xxxxxx>
>> Cc: "ceph-users" <ceph-users@xxxxxxx>
>> Sent: Tuesday, 1 December, 2020 16:53:50
>> Subject: Re:  Re: osd_pglog memory hoarding - another case
> 
>> Hi Kalle,
>> 
>> Thanks for the update. Unfortunately I haven't made any progress on
>> understanding the root cause of this issue.
>> (We are still tracking our mempools closely in grafana and in our case
>> they are no longer exploding like in the incident.)
>> 
>> Cheers, Dan
>> 
>> On Tue, Dec 1, 2020 at 3:49 PM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
>>>
>>> Quick update, restarting OSDs is not enough for us to compact the db. So we
>>> stop the osd
>>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact
>>> start the osd
>>>
>>> It seems to fix the spillover. Until it grows again.
>>>
>>> Cheers,
>>> Kalle
>>>
>>> ----- Original Message -----
>>> > From: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>> > Cc: "ceph-users" <ceph-users@xxxxxxx>
>>> > Sent: Tuesday, 1 December, 2020 15:09:37
>>> > Subject:  Re: osd_pglog memory hoarding - another case
>>>
>>> > Hi All,
>>> > back to this. Dan, it seems we're following exactly in your footsteps.
>>> >
>>> > We recovered from our large pg_log, and got the cluster running. A week after
>>> > our cluster was ok, we started seeing big memory increases again. I don't know
>>> > if we had buffer_anon issues before or if our big pg_logs were masking it. But
>>> > we started seeing bluefs spillover and buffer_anon growth.
>>> >
>>> > This led to whole other series of problems with OOM killing, which probably
>>> > resulted in mon node db growth which filled the disk, which  resulted in all
>>> > mons going down, and a bigger mess of bringing everything up.
>>> >
>>> > However. We're back. But I think we can confirm the buffer_anon growth, and
>>> > bluefs spillover.
>>> >
>>> > We now have a job that constatly writes 10k objects in a buckets and deletes
>>> > them.
>>> >
>>> > This may curb the memory growth, but I don't think it stops the problem. We're
>>> > just testing restarting OSDs and while it takes a while, it seems it may help.
>>> > Of course this is not the greatest fix in production.
>>> >
>>> > Has anybody gleaned any new information on this issue? Things to tweaks? Fixes
>>> > in the horizon? Other mitigations?
>>> >
>>> > Cheers,
>>> > Kalle
>>> >
>>> >
>>> > ----- Original Message -----
>>> >> From: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>> >> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>> >> Cc: "ceph-users" <ceph-users@xxxxxxx>
>>> >> Sent: Thursday, 19 November, 2020 13:56:37
>>> >> Subject:  Re: osd_pglog memory hoarding - another case
>>> >
>>> >> Hello,
>>> >> I thought I'd post an update.
>>> >>
>>> >> Setting the pg_log size to 500, and running the offline trim operation
>>> >> sequentially on all OSDs seems to help. With our current setup, it takes about
>>> >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have
>>> >> are ~180-750, with a majority around 200, and some nodes consistently have 500
>>> >> per OSD. The limiting factor of the recovery time seems to be our nvme, which
>>> >> we use for rocksdb for the OSDs.
>>> >>
>>> >> We haven't fully recovered yet, we're working on it. Almost all our PGs are back
>>> >> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently
>>> >> ~40 OSDs/1200 down.
>>> >>
>>> >> It seems like the previous mention of 32kB / pg_log entry seems in the correct
>>> >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're
>>> >> close to the 20 GB / OSD process.
>>> >>
>>> >> For the nodes that have been trimmed, we're hovering around 100 GB/node of
>>> >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer
>>> >> term data on that, and we don't know exactly how it behaves when load is
>>> >> applied. However if we're currently at the pg_log limit of 500, adding load
>>> >> should hopefully not increase pg_log memory consumption.
>>> >>
>>> >> Cheers,
>>> >> Kalle
>>> >>
>>> >> ----- Original Message -----
>>> >>> From: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>> >>> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>> >>> Cc: "ceph-users" <ceph-users@xxxxxxx>
>>> >>> Sent: Tuesday, 17 November, 2020 16:07:03
>>> >>> Subject:  Re: osd_pglog memory hoarding - another case
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>>> I don't think the default osd_min_pg_log_entries has changed recently.
>>> >>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the
>>> >>>> pg log length by memory -- if it is indeed possible for log entries to
>>> >>>> get into several MB, then this would be necessary IMHO.
>>> >>>
>>> >>> I've had a surprising crash course on pg_log in the last 36 hours. But for the
>>> >>> size of each entry, you're right. I counted pg log * ODS, and did not take into
>>> >>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD
>>> >>> uses for pg_log was ~22GB / OSD process.
>>> >>>
>>> >>>
>>> >>>> But you said you were trimming PG logs with the offline tool? How long
>>> >>>> were those logs that needed to be trimmed?
>>> >>>
>>> >>> The logs we are trimming were ~3000, we trimmed them to the new size of 500.
>>> >>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to
>>> >>> what we guess is 2-3GB but with the cluster at this state, it's hard to be
>>> >>> specific.
>>> >>>
>>> >>> Cheers,
>>> >>> Kalle
>>> >>>
>>> >>>
>>> >>>
>>> >>>> -- dan
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
>>> >>>>>
>>> >>>>> Another idea, which I don't know if has any merit.
>>> >>>>>
>>> >>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the
>>> >>>>> enforcement (or default) of the minimum value change lately
>>> >>>>> (osd_min_pg_log_entries)?
>>> >>>>>
>>> >>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have
>>> >>>>> issues with memory.
>>> >>>>>
>>> >>>>> Cheers,
>>> >>>>> Kalle
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> ----- Original Message -----
>>> >>>>> > From: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>> >>>>> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>> >>>>> > Cc: "ceph-users" <ceph-users@xxxxxxx>
>>> >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25
>>> >>>>> > Subject:  Re: osd_pglog memory hoarding - another case
>>> >>>>>
>>> >>>>> > Hi Dan @ co.,
>>> >>>>> > Thanks for the support (moral and technical).
>>> >>>>> >
>>> >>>>> > That sounds like a good guess, but it seems like there is nothing alarming here.
>>> >>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional
>>> >>>>> > values.
>>> >>>>> >
>>> >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
>>> >>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
>>> >>>>> >  "pgid": "37.2b9",
>>> >>>>> >  "ondisk_log_size": 3103,
>>> >>>>> >  "pgid": "33.e",
>>> >>>>> >  "ondisk_log_size": 3229,
>>> >>>>> >  "pgid": "7.2",
>>> >>>>> >  "ondisk_log_size": 3111,
>>> >>>>> >  "pgid": "26.4",
>>> >>>>> >  "ondisk_log_size": 3185,
>>> >>>>> >  "pgid": "33.4",
>>> >>>>> >  "ondisk_log_size": 3311,
>>> >>>>> >  "pgid": "33.8",
>>> >>>>> >  "ondisk_log_size": 3278,
>>> >>>>> >
>>> >>>>> > I also have no idea what the average size of a pg log entry should be, in our
>>> >>>>> > case it seems it's around 8 MB (22GB/3000 entires).
>>> >>>>> >
>>> >>>>> > Cheers,
>>> >>>>> > Kalle
>>> >>>>> >
>>> >>>>> > ----- Original Message -----
>>> >>>>> >> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>>> >>>>> >> To: "Kalle Happonen" <kalle.happonen@xxxxxx>
>>> >>>>> >> Cc: "ceph-users" <ceph-users@xxxxxxx>, "xie xingguo" <xie.xingguo@xxxxxxxxxx>,
>>> >>>>> >> "Samuel Just" <sjust@xxxxxxxxxx>
>>> >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28
>>> >>>>> >> Subject: Re:  osd_pglog memory hoarding - another case
>>> >>>>> >
>>> >>>>> >> Hi Kalle,
>>> >>>>> >>
>>> >>>>> >> Do you have active PGs now with huge pglogs?
>>> >>>>> >> You can do something like this to find them:
>>> >>>>> >>
>>> >>>>> >>   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
>>> >>>>> >> select(.ondisk_log_size > 3000)'
>>> >>>>> >>
>>> >>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log.
>>> >>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or
>>> >>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
>>> >>>>> >> might show other issues.
>>> >>>>> >>
>>> >>>>> >> Cheers, dan
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>> >>>>> >>>
>>> >>>>> >>> Hi Kalle,
>>> >>>>> >>>
>>> >>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur
>>> >>>>> >>> after that incident. So I can mostly only offer moral support.
>>> >>>>> >>>
>>> >>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
>>> >>>>> >>> think this is suspicious:
>>> >>>>> >>>
>>> >>>>> >>>    b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
>>> >>>>> >>>
>>> >>>>> >>>    https://github.com/ceph/ceph/commit/b670715eb4
>>> >>>>> >>>
>>> >>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if
>>> >>>>> >>> there could be an unforeseen condition where `last_update_ondisk`
>>> >>>>> >>> isn't being updated correctly, and therefore the osd stops trimming
>>> >>>>> >>> the pg_log altogether.
>>> >>>>> >>>
>>> >>>>> >>> Xie or Samuel: does that sound possible?
>>> >>>>> >>>
>>> >>>>> >>> Cheers, Dan
>>> >>>>> >>>
>>> >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
>>> >>>>> >>> >
>>> >>>>> >>> > Hello all,
>>> >>>>> >>> > wrt:
>>> >>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>>> >>>>> >>> >
>>> >>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above.
>>> >>>>> >>> >
>>> >>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node.
>>> >>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>>> >>>>> >>> >
>>> >>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory
>>> >>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000.
>>> >>>>> >>> > The user traffic doesn't seem to have been exceptional lately.
>>> >>>>> >>> >
>>> >>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory
>>> >>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30
>>> >>>>> >>> > GB/day, until the servers started OOM killing OSD processes.
>>> >>>>> >>> >
>>> >>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process
>>> >>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then
>>> >>>>> >>> > the cluster was in an unstable situation. This is significantly more than the
>>> >>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the
>>> >>>>> >>> > size.
>>> >>>>> >>> >
>>> >>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can,
>>> >>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some
>>> >>>>> >>> > nodes, but we're  still recovering, and have a lot of ODSs down and out still.
>>> >>>>> >>> >
>>> >>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered
>>> >>>>> >>> > this (or something unrelated we don't see).
>>> >>>>> >>> >
>>> >>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size
>>> >>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated.
>>> >>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to
>>> >>>>> >>> > put a data point out there for other debuggers.
>>> >>>>> >>> >
>>> >>>>> >>> > Cheers,
>>> >>>>> >>> > Kalle Happonen
>>> >>>>> >>> > _______________________________________________
>>> >>>>> >>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> >>>>> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >>>>> > _______________________________________________
>>> >>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> >>>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >>> _______________________________________________
>>> >>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >> _______________________________________________
>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx