Re: osd_pglog memory hoarding - another case

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 17 Nov 2020 06:49:59 -0600

Hi Dan,

I 100% agree with your proposal.  One of the goals I had in mind with 
the prioritycache framework is that pglog could end up becoming another 
prioritycache target that is balanced against the other caches.  The 
idea would be that we try to keep some amount of pglog data in memory at 
high priority but ultimately the longer the log gets the less priority 
it gets relative to onode cache and other things (with some 
minimums/maximums in place as well).  Just yesterday Josh and I were 
also talking about the possibility of keeping a longer running log on 
disk than what's represented in memory as well.  This could have 
implications for peering performance, but frankly I don't see how we 
keep using log based recovery in a world where we are putting OSDs on 
devices capable of hundreds of thousands of write IOPS.

Mark

On 11/17/20 5:13 AM, Dan van der Ster wrote:
I don't think the default osd_min_pg_log_entries has changed recently.
In https://tracker.ceph.com/issues/47775 I proposed that we limit the
pg log length by memory -- if it is indeed possible for log entries to
get into several MB, then this would be necessary IMHO.

But you said you were trimming PG logs with the offline tool? How long
were those logs that needed to be trimmed?

-- dan

On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
Another idea, which I don't know if has any merit.

If 8 MB is a realistic log size (or has this grown for some reason?), did the enforcement (or default) of the minimum value change lately (osd_min_pg_log_entries)?

If the minimum amount would be set to 1000, at 8 MB per log, we would have issues with memory.

Cheers,
Kalle

----- Original Message -----
From: "Kalle Happonen" <kalle.happonen@xxxxxx>
To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxx>
Sent: Tuesday, 17 November, 2020 12:45:25
Subject:  Re: osd_pglog memory hoarding - another case
Hi Dan @ co.,
Thanks for the support (moral and technical).

That sounds like a good guess, but it seems like there is nothing alarming here.
In all our pools, some pgs are a bit over 3100, but not at any exceptional
values.

cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
  "pgid": "37.2b9",
  "ondisk_log_size": 3103,
  "pgid": "33.e",
  "ondisk_log_size": 3229,
  "pgid": "7.2",
  "ondisk_log_size": 3111,
  "pgid": "26.4",
  "ondisk_log_size": 3185,
  "pgid": "33.4",
  "ondisk_log_size": 3311,
  "pgid": "33.8",
  "ondisk_log_size": 3278,

I also have no idea what the average size of a pg log entry should be, in our
case it seems it's around 8 MB (22GB/3000 entires).

Cheers,
Kalle

----- Original Message -----
From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
To: "Kalle Happonen" <kalle.happonen@xxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxx>, "xie xingguo" <xie.xingguo@xxxxxxxxxx>,
"Samuel Just" <sjust@xxxxxxxxxx>
Sent: Tuesday, 17 November, 2020 12:22:28
Subject: Re:  osd_pglog memory hoarding - another case
Hi Kalle,

Do you have active PGs now with huge pglogs?
You can do something like this to find them:

   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
select(.ondisk_log_size > 3000)'

If you find some, could you increase to debug_osd = 10 then share the osd log.
I am interested in the debug lines from calc_trim_to_aggressively (or
calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
might show other issues.

Cheers, dan

On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
Hi Kalle,

Strangely and luckily, in our case the memory explosion didn't reoccur
after that incident. So I can mostly only offer moral support.

But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
think this is suspicious:

    b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk

    https://github.com/ceph/ceph/commit/b670715eb4

Given that it adds a case where the pg_log is not trimmed, I wonder if
there could be an unforeseen condition where `last_update_ondisk`
isn't being updated correctly, and therefore the osd stops trimming
the pg_log altogether.

Xie or Samuel: does that sound possible?

Cheers, Dan

On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
Hello all,
wrt:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/

Yesterday we hit a problem with osd_pglog memory, similar to the thread above.

We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node.
We run 8+3 EC for the data pool (metadata is on replicated nvme pool).

The cluster has been running fine, and (as relevant to the post) the memory
usage has been stable at 100 GB / node. We've had the default pg_log of 3000.
The user traffic doesn't seem to have been exceptional lately.

Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory
usage on OSD nodes started to grow. On each node it grew steadily about 30
GB/day, until the servers started OOM killing OSD processes.

After a lot of debugging we found that the pg_logs were huge. Each OSD process
pg_log had grown to ~22GB, which we naturally didn't have memory for, and then
the cluster was in an unstable situation. This is significantly more than the
1,5 GB in the post above. We do have ~20k pgs, which may directly affect the
size.

We've reduced the pg_log to 500, and started offline trimming it where we can,
and also just waited. The pg_log size dropped to ~1,2 GB on at least some
nodes, but we're  still recovering, and have a lot of ODSs down and out still.

We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered
this (or something unrelated we don't see).

This mail is mostly to figure out if there are good guesses why the pg_log size
per OSD process exploded? Any technical (and moral) support is appreciated.
Also, currently we're not sure if 14.2.13 triggered this, so this is also to
put a data point out there for other debuggers.

Cheers,
Kalle Happonen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx