For anybody facing similar issues, we wrote a blog post about everything we faced, and how we worked through it. https://cloud.blog.csc.fi/2020/12/allas-november-2020-incident-details.html Cheers, Kalle ----- Original Message ----- > From: "Kalle Happonen" <kalle.happonen@xxxxxx> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxx> > Sent: Monday, 14 December, 2020 10:25:32 > Subject: Re: osd_pglog memory hoarding - another case > Hi all, > Ok, so I have some updates on this. > > We noticed that we had a bucket with tons of RGW garbage collection pending. It > was growing faster than we could clean it up. > > We suspect this was because users tried to do "s3cmd sync" operations on SWIFT > uploaded large files. This could logically cause issues as s3 and SWIFT > calculate md5sums differently on large objects. > > The following command showed the pending gc, and also shows which buckets are > affected. > > radosgw-admin gc list |grep oid >garbagecollectionlist.txt > > Our total RGW GC backlog was up to ~40 M. > > We stopped the main s3sync workflow which was affecting the GC growth. Then we > started running more aggressive radosgw garbage collection. > > This really helped with the memory use. It dropped a lot, and for now *knock on > wood* when the GC has been cleaned up, the memory has stayed at a more stable > lower level. > > So we hope we found the (or a) trigger for the problem. > > Hopefully reveals another thread to pull for others debugging the same issue > (and for us when we hit it again). > > Cheers, > Kalle > > ----- Original Message ----- >> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >> To: "Kalle Happonen" <kalle.happonen@xxxxxx> >> Cc: "ceph-users" <ceph-users@xxxxxxx> >> Sent: Tuesday, 1 December, 2020 16:53:50 >> Subject: Re: Re: osd_pglog memory hoarding - another case > >> Hi Kalle, >> >> Thanks for the update. Unfortunately I haven't made any progress on >> understanding the root cause of this issue. >> (We are still tracking our mempools closely in grafana and in our case >> they are no longer exploding like in the incident.) >> >> Cheers, Dan >> >> On Tue, Dec 1, 2020 at 3:49 PM Kalle Happonen <kalle.happonen@xxxxxx> wrote: >>> >>> Quick update, restarting OSDs is not enough for us to compact the db. So we >>> stop the osd >>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact >>> start the osd >>> >>> It seems to fix the spillover. Until it grows again. >>> >>> Cheers, >>> Kalle >>> >>> ----- Original Message ----- >>> > From: "Kalle Happonen" <kalle.happonen@xxxxxx> >>> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>> > Cc: "ceph-users" <ceph-users@xxxxxxx> >>> > Sent: Tuesday, 1 December, 2020 15:09:37 >>> > Subject: Re: osd_pglog memory hoarding - another case >>> >>> > Hi All, >>> > back to this. Dan, it seems we're following exactly in your footsteps. >>> > >>> > We recovered from our large pg_log, and got the cluster running. A week after >>> > our cluster was ok, we started seeing big memory increases again. I don't know >>> > if we had buffer_anon issues before or if our big pg_logs were masking it. But >>> > we started seeing bluefs spillover and buffer_anon growth. >>> > >>> > This led to whole other series of problems with OOM killing, which probably >>> > resulted in mon node db growth which filled the disk, which resulted in all >>> > mons going down, and a bigger mess of bringing everything up. >>> > >>> > However. We're back. But I think we can confirm the buffer_anon growth, and >>> > bluefs spillover. >>> > >>> > We now have a job that constatly writes 10k objects in a buckets and deletes >>> > them. >>> > >>> > This may curb the memory growth, but I don't think it stops the problem. We're >>> > just testing restarting OSDs and while it takes a while, it seems it may help. >>> > Of course this is not the greatest fix in production. >>> > >>> > Has anybody gleaned any new information on this issue? Things to tweaks? Fixes >>> > in the horizon? Other mitigations? >>> > >>> > Cheers, >>> > Kalle >>> > >>> > >>> > ----- Original Message ----- >>> >> From: "Kalle Happonen" <kalle.happonen@xxxxxx> >>> >> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>> >> Cc: "ceph-users" <ceph-users@xxxxxxx> >>> >> Sent: Thursday, 19 November, 2020 13:56:37 >>> >> Subject: Re: osd_pglog memory hoarding - another case >>> > >>> >> Hello, >>> >> I thought I'd post an update. >>> >> >>> >> Setting the pg_log size to 500, and running the offline trim operation >>> >> sequentially on all OSDs seems to help. With our current setup, it takes about >>> >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have >>> >> are ~180-750, with a majority around 200, and some nodes consistently have 500 >>> >> per OSD. The limiting factor of the recovery time seems to be our nvme, which >>> >> we use for rocksdb for the OSDs. >>> >> >>> >> We haven't fully recovered yet, we're working on it. Almost all our PGs are back >>> >> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently >>> >> ~40 OSDs/1200 down. >>> >> >>> >> It seems like the previous mention of 32kB / pg_log entry seems in the correct >>> >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're >>> >> close to the 20 GB / OSD process. >>> >> >>> >> For the nodes that have been trimmed, we're hovering around 100 GB/node of >>> >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer >>> >> term data on that, and we don't know exactly how it behaves when load is >>> >> applied. However if we're currently at the pg_log limit of 500, adding load >>> >> should hopefully not increase pg_log memory consumption. >>> >> >>> >> Cheers, >>> >> Kalle >>> >> >>> >> ----- Original Message ----- >>> >>> From: "Kalle Happonen" <kalle.happonen@xxxxxx> >>> >>> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>> >>> Cc: "ceph-users" <ceph-users@xxxxxxx> >>> >>> Sent: Tuesday, 17 November, 2020 16:07:03 >>> >>> Subject: Re: osd_pglog memory hoarding - another case >>> >> >>> >>> Hi, >>> >>> >>> >>>> I don't think the default osd_min_pg_log_entries has changed recently. >>> >>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the >>> >>>> pg log length by memory -- if it is indeed possible for log entries to >>> >>>> get into several MB, then this would be necessary IMHO. >>> >>> >>> >>> I've had a surprising crash course on pg_log in the last 36 hours. But for the >>> >>> size of each entry, you're right. I counted pg log * ODS, and did not take into >>> >>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD >>> >>> uses for pg_log was ~22GB / OSD process. >>> >>> >>> >>> >>> >>>> But you said you were trimming PG logs with the offline tool? How long >>> >>>> were those logs that needed to be trimmed? >>> >>> >>> >>> The logs we are trimming were ~3000, we trimmed them to the new size of 500. >>> >>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to >>> >>> what we guess is 2-3GB but with the cluster at this state, it's hard to be >>> >>> specific. >>> >>> >>> >>> Cheers, >>> >>> Kalle >>> >>> >>> >>> >>> >>> >>> >>>> -- dan >>> >>>> >>> >>>> >>> >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote: >>> >>>>> >>> >>>>> Another idea, which I don't know if has any merit. >>> >>>>> >>> >>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the >>> >>>>> enforcement (or default) of the minimum value change lately >>> >>>>> (osd_min_pg_log_entries)? >>> >>>>> >>> >>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have >>> >>>>> issues with memory. >>> >>>>> >>> >>>>> Cheers, >>> >>>>> Kalle >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> ----- Original Message ----- >>> >>>>> > From: "Kalle Happonen" <kalle.happonen@xxxxxx> >>> >>>>> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>> >>>>> > Cc: "ceph-users" <ceph-users@xxxxxxx> >>> >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25 >>> >>>>> > Subject: Re: osd_pglog memory hoarding - another case >>> >>>>> >>> >>>>> > Hi Dan @ co., >>> >>>>> > Thanks for the support (moral and technical). >>> >>>>> > >>> >>>>> > That sounds like a good guess, but it seems like there is nothing alarming here. >>> >>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >>> >>>>> > values. >>> >>>>> > >>> >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >>> >>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >>> >>>>> > "pgid": "37.2b9", >>> >>>>> > "ondisk_log_size": 3103, >>> >>>>> > "pgid": "33.e", >>> >>>>> > "ondisk_log_size": 3229, >>> >>>>> > "pgid": "7.2", >>> >>>>> > "ondisk_log_size": 3111, >>> >>>>> > "pgid": "26.4", >>> >>>>> > "ondisk_log_size": 3185, >>> >>>>> > "pgid": "33.4", >>> >>>>> > "ondisk_log_size": 3311, >>> >>>>> > "pgid": "33.8", >>> >>>>> > "ondisk_log_size": 3278, >>> >>>>> > >>> >>>>> > I also have no idea what the average size of a pg log entry should be, in our >>> >>>>> > case it seems it's around 8 MB (22GB/3000 entires). >>> >>>>> > >>> >>>>> > Cheers, >>> >>>>> > Kalle >>> >>>>> > >>> >>>>> > ----- Original Message ----- >>> >>>>> >> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>> >>>>> >> To: "Kalle Happonen" <kalle.happonen@xxxxxx> >>> >>>>> >> Cc: "ceph-users" <ceph-users@xxxxxxx>, "xie xingguo" <xie.xingguo@xxxxxxxxxx>, >>> >>>>> >> "Samuel Just" <sjust@xxxxxxxxxx> >>> >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28 >>> >>>>> >> Subject: Re: osd_pglog memory hoarding - another case >>> >>>>> > >>> >>>>> >> Hi Kalle, >>> >>>>> >> >>> >>>>> >> Do you have active PGs now with huge pglogs? >>> >>>>> >> You can do something like this to find them: >>> >>>>> >> >>> >>>>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >>> >>>>> >> select(.ondisk_log_size > 3000)' >>> >>>>> >> >>> >>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >>> >>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or >>> >>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >>> >>>>> >> might show other issues. >>> >>>>> >> >>> >>>>> >> Cheers, dan >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>> >>>>> >>> >>> >>>>> >>> Hi Kalle, >>> >>>>> >>> >>> >>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>> >>>>> >>> after that incident. So I can mostly only offer moral support. >>> >>>>> >>> >>> >>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>> >>>>> >>> think this is suspicious: >>> >>>>> >>> >>> >>>>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>> >>>>> >>> >>> >>>>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>> >>>>> >>> >>> >>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>> >>>>> >>> there could be an unforeseen condition where `last_update_ondisk` >>> >>>>> >>> isn't being updated correctly, and therefore the osd stops trimming >>> >>>>> >>> the pg_log altogether. >>> >>>>> >>> >>> >>>>> >>> Xie or Samuel: does that sound possible? >>> >>>>> >>> >>> >>>>> >>> Cheers, Dan >>> >>>>> >>> >>> >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote: >>> >>>>> >>> > >>> >>>>> >>> > Hello all, >>> >>>>> >>> > wrt: >>> >>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/ >>> >>>>> >>> > >>> >>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>> >>>>> >>> > >>> >>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>> >>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>> >>>>> >>> > >>> >>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory >>> >>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>> >>>>> >>> > The user traffic doesn't seem to have been exceptional lately. >>> >>>>> >>> > >>> >>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>> >>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >>> >>>>> >>> > GB/day, until the servers started OOM killing OSD processes. >>> >>>>> >>> > >>> >>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >>> >>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>> >>>>> >>> > the cluster was in an unstable situation. This is significantly more than the >>> >>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>> >>>>> >>> > size. >>> >>>>> >>> > >>> >>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >>> >>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>> >>>>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >>> >>>>> >>> > >>> >>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>> >>>>> >>> > this (or something unrelated we don't see). >>> >>>>> >>> > >>> >>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >>> >>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >>> >>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>> >>>>> >>> > put a data point out there for other debuggers. >>> >>>>> >>> > >>> >>>>> >>> > Cheers, >>> >>>>> >>> > Kalle Happonen >>> >>>>> >>> > _______________________________________________ >>> >>>>> >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> >>>>> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >>>>> > _______________________________________________ >>> >>>>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> >>>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >>> _______________________________________________ >>> >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> _______________________________________________ >>> >> ceph-users mailing list -- ceph-users@xxxxxxx >>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx