Quick update, restarting OSDs is not enough for us to compact the db. So we stop the osd ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact start the osd It seems to fix the spillover. Until it grows again. Cheers, Kalle ----- Original Message ----- > From: "Kalle Happonen" <kalle.happonen@xxxxxx> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> > Cc: "ceph-users" <ceph-users@xxxxxxx> > Sent: Tuesday, 1 December, 2020 15:09:37 > Subject: Re: osd_pglog memory hoarding - another case > Hi All, > back to this. Dan, it seems we're following exactly in your footsteps. > > We recovered from our large pg_log, and got the cluster running. A week after > our cluster was ok, we started seeing big memory increases again. I don't know > if we had buffer_anon issues before or if our big pg_logs were masking it. But > we started seeing bluefs spillover and buffer_anon growth. > > This led to whole other series of problems with OOM killing, which probably > resulted in mon node db growth which filled the disk, which resulted in all > mons going down, and a bigger mess of bringing everything up. > > However. We're back. But I think we can confirm the buffer_anon growth, and > bluefs spillover. > > We now have a job that constatly writes 10k objects in a buckets and deletes > them. > > This may curb the memory growth, but I don't think it stops the problem. We're > just testing restarting OSDs and while it takes a while, it seems it may help. > Of course this is not the greatest fix in production. > > Has anybody gleaned any new information on this issue? Things to tweaks? Fixes > in the horizon? Other mitigations? > > Cheers, > Kalle > > > ----- Original Message ----- >> From: "Kalle Happonen" <kalle.happonen@xxxxxx> >> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >> Cc: "ceph-users" <ceph-users@xxxxxxx> >> Sent: Thursday, 19 November, 2020 13:56:37 >> Subject: Re: osd_pglog memory hoarding - another case > >> Hello, >> I thought I'd post an update. >> >> Setting the pg_log size to 500, and running the offline trim operation >> sequentially on all OSDs seems to help. With our current setup, it takes about >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have >> are ~180-750, with a majority around 200, and some nodes consistently have 500 >> per OSD. The limiting factor of the recovery time seems to be our nvme, which >> we use for rocksdb for the OSDs. >> >> We haven't fully recovered yet, we're working on it. Almost all our PGs are back >> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently >> ~40 OSDs/1200 down. >> >> It seems like the previous mention of 32kB / pg_log entry seems in the correct >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're >> close to the 20 GB / OSD process. >> >> For the nodes that have been trimmed, we're hovering around 100 GB/node of >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer >> term data on that, and we don't know exactly how it behaves when load is >> applied. However if we're currently at the pg_log limit of 500, adding load >> should hopefully not increase pg_log memory consumption. >> >> Cheers, >> Kalle >> >> ----- Original Message ----- >>> From: "Kalle Happonen" <kalle.happonen@xxxxxx> >>> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>> Cc: "ceph-users" <ceph-users@xxxxxxx> >>> Sent: Tuesday, 17 November, 2020 16:07:03 >>> Subject: Re: osd_pglog memory hoarding - another case >> >>> Hi, >>> >>>> I don't think the default osd_min_pg_log_entries has changed recently. >>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the >>>> pg log length by memory -- if it is indeed possible for log entries to >>>> get into several MB, then this would be necessary IMHO. >>> >>> I've had a surprising crash course on pg_log in the last 36 hours. But for the >>> size of each entry, you're right. I counted pg log * ODS, and did not take into >>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD >>> uses for pg_log was ~22GB / OSD process. >>> >>> >>>> But you said you were trimming PG logs with the offline tool? How long >>>> were those logs that needed to be trimmed? >>> >>> The logs we are trimming were ~3000, we trimmed them to the new size of 500. >>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to >>> what we guess is 2-3GB but with the cluster at this state, it's hard to be >>> specific. >>> >>> Cheers, >>> Kalle >>> >>> >>> >>>> -- dan >>>> >>>> >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote: >>>>> >>>>> Another idea, which I don't know if has any merit. >>>>> >>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the >>>>> enforcement (or default) of the minimum value change lately >>>>> (osd_min_pg_log_entries)? >>>>> >>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have >>>>> issues with memory. >>>>> >>>>> Cheers, >>>>> Kalle >>>>> >>>>> >>>>> >>>>> ----- Original Message ----- >>>>> > From: "Kalle Happonen" <kalle.happonen@xxxxxx> >>>>> > To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>>>> > Cc: "ceph-users" <ceph-users@xxxxxxx> >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25 >>>>> > Subject: Re: osd_pglog memory hoarding - another case >>>>> >>>>> > Hi Dan @ co., >>>>> > Thanks for the support (moral and technical). >>>>> > >>>>> > That sounds like a good guess, but it seems like there is nothing alarming here. >>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >>>>> > values. >>>>> > >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >>>>> > "pgid": "37.2b9", >>>>> > "ondisk_log_size": 3103, >>>>> > "pgid": "33.e", >>>>> > "ondisk_log_size": 3229, >>>>> > "pgid": "7.2", >>>>> > "ondisk_log_size": 3111, >>>>> > "pgid": "26.4", >>>>> > "ondisk_log_size": 3185, >>>>> > "pgid": "33.4", >>>>> > "ondisk_log_size": 3311, >>>>> > "pgid": "33.8", >>>>> > "ondisk_log_size": 3278, >>>>> > >>>>> > I also have no idea what the average size of a pg log entry should be, in our >>>>> > case it seems it's around 8 MB (22GB/3000 entires). >>>>> > >>>>> > Cheers, >>>>> > Kalle >>>>> > >>>>> > ----- Original Message ----- >>>>> >> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >>>>> >> To: "Kalle Happonen" <kalle.happonen@xxxxxx> >>>>> >> Cc: "ceph-users" <ceph-users@xxxxxxx>, "xie xingguo" <xie.xingguo@xxxxxxxxxx>, >>>>> >> "Samuel Just" <sjust@xxxxxxxxxx> >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28 >>>>> >> Subject: Re: osd_pglog memory hoarding - another case >>>>> > >>>>> >> Hi Kalle, >>>>> >> >>>>> >> Do you have active PGs now with huge pglogs? >>>>> >> You can do something like this to find them: >>>>> >> >>>>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >>>>> >> select(.ondisk_log_size > 3000)' >>>>> >> >>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or >>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >>>>> >> might show other issues. >>>>> >> >>>>> >> Cheers, dan >>>>> >> >>>>> >> >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>>>> >>> >>>>> >>> Hi Kalle, >>>>> >>> >>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>>>> >>> after that incident. So I can mostly only offer moral support. >>>>> >>> >>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>>>> >>> think this is suspicious: >>>>> >>> >>>>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>>>> >>> >>>>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>>>> >>> >>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>>>> >>> there could be an unforeseen condition where `last_update_ondisk` >>>>> >>> isn't being updated correctly, and therefore the osd stops trimming >>>>> >>> the pg_log altogether. >>>>> >>> >>>>> >>> Xie or Samuel: does that sound possible? >>>>> >>> >>>>> >>> Cheers, Dan >>>>> >>> >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote: >>>>> >>> > >>>>> >>> > Hello all, >>>>> >>> > wrt: >>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/ >>>>> >>> > >>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>>>> >>> > >>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>>>> >>> > >>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory >>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>>>> >>> > The user traffic doesn't seem to have been exceptional lately. >>>>> >>> > >>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >>>>> >>> > GB/day, until the servers started OOM killing OSD processes. >>>>> >>> > >>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>>>> >>> > the cluster was in an unstable situation. This is significantly more than the >>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>>>> >>> > size. >>>>> >>> > >>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>>>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >>>>> >>> > >>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>>>> >>> > this (or something unrelated we don't see). >>>>> >>> > >>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>>>> >>> > put a data point out there for other debuggers. >>>>> >>> > >>>>> >>> > Cheers, >>>>> >>> > Kalle Happonen >>>>> >>> > _______________________________________________ >>>>> >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>>>> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> > _______________________________________________ >>>>> > ceph-users mailing list -- ceph-users@xxxxxxx >>>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx