Re: osd_pglog memory hoarding - another case

Kalle Happonen <kalle.happonen@xxxxxx> · Tue, 17 Nov 2020 12:58:45 +0200 (EET)

Another idea, which I don't know if has any merit. 

If 8 MB is a realistic log size (or has this grown for some reason?), did the enforcement (or default) of the minimum value change lately (osd_min_pg_log_entries)?

If the minimum amount would be set to 1000, at 8 MB per log, we would have issues with memory.

Cheers,
Kalle

----- Original Message -----
> From: "Kalle Happonen" <kalle.happonen@xxxxxx>
> To: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxx>
> Sent: Tuesday, 17 November, 2020 12:45:25
> Subject:  Re: osd_pglog memory hoarding - another case

> Hi Dan @ co.,
> Thanks for the support (moral and technical).
> 
> That sounds like a good guess, but it seems like there is nothing alarming here.
> In all our pools, some pgs are a bit over 3100, but not at any exceptional
> values.
> 
> cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
> select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
>  "pgid": "37.2b9",
>  "ondisk_log_size": 3103,
>  "pgid": "33.e",
>  "ondisk_log_size": 3229,
>  "pgid": "7.2",
>  "ondisk_log_size": 3111,
>  "pgid": "26.4",
>  "ondisk_log_size": 3185,
>  "pgid": "33.4",
>  "ondisk_log_size": 3311,
>  "pgid": "33.8",
>  "ondisk_log_size": 3278,
> 
> I also have no idea what the average size of a pg log entry should be, in our
> case it seems it's around 8 MB (22GB/3000 entires).
> 
> Cheers,
> Kalle
> 
> ----- Original Message -----
>> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>> To: "Kalle Happonen" <kalle.happonen@xxxxxx>
>> Cc: "ceph-users" <ceph-users@xxxxxxx>, "xie xingguo" <xie.xingguo@xxxxxxxxxx>,
>> "Samuel Just" <sjust@xxxxxxxxxx>
>> Sent: Tuesday, 17 November, 2020 12:22:28
>> Subject: Re:  osd_pglog memory hoarding - another case
> 
>> Hi Kalle,
>> 
>> Do you have active PGs now with huge pglogs?
>> You can do something like this to find them:
>> 
>>   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
>> select(.ondisk_log_size > 3000)'
>> 
>> If you find some, could you increase to debug_osd = 10 then share the osd log.
>> I am interested in the debug lines from calc_trim_to_aggressively (or
>> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
>> might show other issues.
>> 
>> Cheers, dan
>> 
>> 
>> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>>
>>> Hi Kalle,
>>>
>>> Strangely and luckily, in our case the memory explosion didn't reoccur
>>> after that incident. So I can mostly only offer moral support.
>>>
>>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
>>> think this is suspicious:
>>>
>>>    b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
>>>
>>>    https://github.com/ceph/ceph/commit/b670715eb4
>>>
>>> Given that it adds a case where the pg_log is not trimmed, I wonder if
>>> there could be an unforeseen condition where `last_update_ondisk`
>>> isn't being updated correctly, and therefore the osd stops trimming
>>> the pg_log altogether.
>>>
>>> Xie or Samuel: does that sound possible?
>>>
>>> Cheers, Dan
>>>
>>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen@xxxxxx> wrote:
>>> >
>>> > Hello all,
>>> > wrt:
>>> > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>>> >
>>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above.
>>> >
>>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node.
>>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>>> >
>>> > The cluster has been running fine, and (as relevant to the post) the memory
>>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000.
>>> > The user traffic doesn't seem to have been exceptional lately.
>>> >
>>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory
>>> > usage on OSD nodes started to grow. On each node it grew steadily about 30
>>> > GB/day, until the servers started OOM killing OSD processes.
>>> >
>>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process
>>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then
>>> > the cluster was in an unstable situation. This is significantly more than the
>>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the
>>> > size.
>>> >
>>> > We've reduced the pg_log to 500, and started offline trimming it where we can,
>>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some
>>> > nodes, but we're  still recovering, and have a lot of ODSs down and out still.
>>> >
>>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered
>>> > this (or something unrelated we don't see).
>>> >
>>> > This mail is mostly to figure out if there are good guesses why the pg_log size
>>> > per OSD process exploded? Any technical (and moral) support is appreciated.
>>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to
>>> > put a data point out there for other debuggers.
>>> >
>>> > Cheers,
>>> > Kalle Happonen
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx