Re: Database size stays constant but disk space keeps shrinking -- postgres 9.1

Greg Williamson <gwilliamson39@xxxxxxxxx> · Tue, 2 Oct 2012 15:02:23 -0700 (PDT)

I've done some more testing and the problem seems to be repmgr itself.

A few details below...

----- Original Message -----
> From: Greg Williamson <gwilliamson39@xxxxxxxxx>
> To: Tom Lane <tgl@xxxxxxxxxxxxx>
> Cc: "pgsql-admin@xxxxxxxxxxxxxx" <pgsql-admin@xxxxxxxxxxxxxx>
> Sent: Thursday, September 27, 2012 7:23 PM
> Subject: Re:  Database size stays constant but disk space keeps shrinking -- postgres 9.1
> 
>T om --
> 
> ----- Original Message -----
>>  From: Tom Lane <tgl@xxxxxxxxxxxxx>
>>  To: Greg Williamson <gwilliamson39@xxxxxxxxx>
>>  Cc: "pgsql-admin@xxxxxxxxxxxxxx" 
> <pgsql-admin@xxxxxxxxxxxxxx>
>>  Sent: Thursday, September 27, 2012 7:14 PM
>>  Subject: Re:  Database size stays constant but disk space keeps 
> shrinking -- postgres 9.1
>> 
>> G reg Williamson <gwilliamson39@xxxxxxxxx> writes:
>>>>   Have you checked to see if there are any processes that have open 
>>  handles to
>>>>   deleted files (lsof -X | grep deleted).
>> 
>>>   lsof -X | grep deleted | wc -l
>> 
>>>   shows: 835 such files.
>> 
>>>   A couple:
>>>   postgres   2540 postgres   50u      REG                8,3     409600  
>     
>>  93429 /var/lib/postgresql/9.1/main/base/2789
>>>   200/11816 (deleted)
>>>   postgres   2540 postgres   51u      REG                8,3   18112512  
> 
>>  49694570 /var/lib/postgresql/9.1/main/base/2789
>>>   200/2791679 (deleted)
>>>   <...>
>> 
>>  So, which processes are holding these open, and what are they doing
>>  exactly?  Let's see output from ps and pg_stat_activity, maybe even
>>  attach to them with gdb and get stack traces.
>> 
>>>   We've a planned restart scheduled soon which will let me find any
>>>   scripts that might be keeping things open,
>> 
>>  A restart will destroy all the evidence, so let's not be in a hurry
>>  to do that before we've identified what's happening.
>> 
>>              regards, tom lane
>> 
> 
> Thanks for the suggestions -- I'll post back when I have more info. Many of 
> these do not seem to have a link to any identifiable process that is still 
> running, but some do and they have pointed me away from the hourly drop / 
> rebuild, at least for now. Looks like the stats database may be the issue.
> 
> Greg W.

I turned off the cronjob that did the hourly database create / drop and am still leaking disk space, but a but slower -- only lost 2 gigs overnight.

While running this process I see these data directories:
postgres@db11:~$ ls -lrt 9.1/main/base
total 200
drwx------ 2 postgres postgres     6 2012-09-21 16:36 pgsql_tmp
drwx------ 2 postgres postgres  8192 2012-10-01 00:26 16387
drwx------ 2 postgres postgres 16384 2012-10-01 00:26 1418400
drwx------ 2 postgres postgres  8192 2012-10-01 00:26 2047839
drwx------ 2 postgres postgres  8192 2012-10-01 00:26 11946
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 16449
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 16392
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 16402
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 11938
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 1
drwx------ 2 postgres postgres  8192 2012-10-01 08:17 16424
drwx------ 2 postgres postgres 32768 2012-10-01 19:20 3171846

When it is done (note the last directory is now gone):
postgres@db11:~$ ls -lrt 9.1/main/base
total 140
drwx------ 2 postgres postgres     6 2012-09-21 16:36 pgsql_tmp
drwx------ 2 postgres postgres  8192 2012-10-01 00:26 16387
drwx------ 2 postgres postgres 16384 2012-10-01 00:26 1418400
drwx------ 2 postgres postgres  8192 2012-10-01 00:26 2047839
drwx------ 2 postgres postgres  8192 2012-10-01 00:26 11946
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 16449
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 16392
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 16402
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 11938
drwx------ 2 postgres postgres  8192 2012-10-01 00:27 1
drwx------ 2 postgres postgres  8192 2012-10-01 08:17 16424

When I run lsof -X and grep for deleted files I see these 4 new entries added since the last database create/drop:
ase/3167420/3169915 (deleted)
postgres  21116 postgres   66u      REG                8,3   19709952  136501576 /var/lib/postgresql/9.1/main/base/3171846/3174279 (deleted)
postgres  21116 postgres   67u      REG                8,3   15450112  136501574 /var/lib/postgresql/9.1/main/base/3171846/3174278 (deleted)
postgres  21116 postgres   68u      REG                8,3   28344320  136410873 /var/lib/postgresql/9.1/main/base/3171846/3172541 (deleted)
postgres  21116 postgres   69u      REG                8,3   82452480  144333458 /var/lib/postgresql/9.1/main/base/3171846/3174341 (deleted)
root@db11:~# 
root@db11:~# ps auxww | grep 21116
postgres 21116  0.0  0.1 100416 32332 ?        Ss   00:26   0:16 postgres: repmgr repmgr 199.9.xxx.yyy(45239) idle               
root     25755  0.0  0.0   6440   840 pts/2    S+   19:38   0:00 grep --color=auto 21116

======

With the database create/drop suspended we still see a steady accumulation of dead file descriptors, but at a slower rate.
< /dev/sda3              67G   28G   39G  42% /
---
> /dev/sda3              67G   29G   38G  44% /

Other than abandoning repmgr I don't see a solution. I've posted this to the repmgr discussion group but have had zero responses (and, frankly, am not holding my breath).

If anyone has any suggestions I'm all ears.

Thanks for the bandwidth!

Greg W.

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin