>
1) Do the 77 share some trait the other 80 don't.
2024-11-21 13:48:34.336 UTC pid=[322035][3] db=[MyDB] usr=[user1] client=[host1] app=[app1]LOG: duration: 1.071 ms parse <unnamed>: SELECT <..>
> 4) Do you have external tools accessing these databases?
We have internal micro-services accessing the databases, as well as a monitoring tool (Netdata), and some of the Devs use pgAdmin. I discarded the scenario where someone would inadvertently do a "pg_stat_reset" via pgAdmin, just because a lot of databases have their stats reset within a short period of time.
> 5) Is the cluster directly open to the world?
No pattern found yet .. but still verifying a few things
> 2) Do the OS system logs reveal anything?
> 2) Do the OS system logs reveal anything?
Nothing found in syslog
> 3) What was happening in the databases just prior to the time the stats
reset?
Here's an example (log extracts) for a stats reset occurrence:
> 3) What was happening in the databases just prior to the time the stats
reset?
Here's an example (log extracts) for a stats reset occurrence:
select datname, stats_reset, now()-stats_reset as since_reset
from pg_stat_database
where ( now()-stats_reset ) < interval '1 day'
order by 3 limit 1;
datname | stats_reset | since_reset
----------------+-------------------------------+-----------------
MyDB | 2024-11-21 13:48:34.332785+00 | 00:00:22.266304
from pg_stat_database
where ( now()-stats_reset ) < interval '1 day'
order by 3 limit 1;
datname | stats_reset | since_reset
----------------+-------------------------------+-----------------
MyDB | 2024-11-21 13:48:34.332785+00 | 00:00:22.266304
<--LOGS-->
2024-11-21 13:48:34.324 UTC pid=[322035][2] db=[MyDB] usr=[user1] client=[host1] app=[[unknown]]LOG: connection authorized: user=user1 database=MyDB applicatio
n_name=app1 <..>
n_name=app1 <..>
<.. no calls at "2024-11-21 13:48:34.332" - WHY?? ..>
2024-11-21 13:48:34.336 UTC pid=[322035][3] db=[MyDB] usr=[user1] client=[host1] app=[app1]LOG: duration: 1.071 ms parse <unnamed>: SELECT <..>
<--LOGS-->
As you can see from above, the stats for MyDB were reset at ".332" . The only logs before/after for the db was the connection (at .324), and then the parse (at .336). NB: I also checked the logs at ".333" in case there would have been a rounding up, but nothing relevant was found. With that said, I only verified one occurence - tomorrow I'll check a few more just to validate.
> 4) Do you have external tools accessing these databases?
We have internal micro-services accessing the databases, as well as a monitoring tool (Netdata), and some of the Devs use pgAdmin. I discarded the scenario where someone would inadvertently do a "pg_stat_reset" via pgAdmin, just because a lot of databases have their stats reset within a short period of time.
On the other hand, Netdata does connect to most (if not all) databases frequently by its nature - so as a test, I stopped the Netdata service today to see if tomorrow we're still seeing the stats reset or not. I can report back tomorrow on this.
> 5) Is the cluster directly open to the world?
No. It's an on-premise installation. Only local applications can connect to it.
-Steeve
On Thu, Nov 21, 2024 at 4:32 PM Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 11/21/24 13:31, Steeve Boulanger wrote:
> > All I can think to do is look at the logs around the stats_reset times
> > for the databases and see if there is anything relevant.
>
> That was already done, but nothing relevant was found unfortunately.
Unless it was not recognized as relevant. Since for the time being I am
eliminating magic as the cause, something concrete is causing this and
it should be leaving a trace. In your post you had this affecting 77 out
of 157 databases in the cluster.
1) Do the 77 share some trait the other 80 don't.
2) Do the OS system logs reveal anything?
3) What was happening in the databases just prior to the time the stats
reset?
4) Do you have external tools accessing these databases?
5) Is the cluster directly open to the world?
>
> -Steeve
>
> On Thu, Nov 21, 2024 at 3:12 PM Adrian Klaver <adrian.klaver@xxxxxxxxxxx
> <mailto:adrian.klaver@xxxxxxxxxxx>> wrote:
>
> On 11/21/24 12:57, Steeve Boulanger wrote:
> >
> > > Please reply to list also.
> >
> > My apologies - I thought I did a "Reply all", but apparently not.
> I'm a
> > little bit of a noob with email distrib lists.
> >
> > > 1) What is log_min_error_statement set to?
> >
> > name | setting | pending_restart
> > -------------------------+---------+-----------------
> > log_min_error_statement | error | f
> >
> > > 2) Did you reload the server when changing?:
> >
> > yes - pg_reload_conf()
>
> All I can think to do is look at the logs around the stats_reset times
> for the databases and see if there is anything relevant.
>
> >
> > -Steeve
>
>
> --
> Adrian Klaver
> adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>
>
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx