Re: Mysterious DB reset

Israel Brewster <israel@xxxxxxxxxxxxx> · Thu, 6 Mar 2014 09:43:32 -0900

On Mar 6, 2014, at 9:03 AM, Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:

> On 03/06/2014 09:33 AM, Israel Brewster wrote:
>> For starters, this happened again this morning (no data prior to 4:45 am and sequence reset), so whatever is going on appears to be reoccurring. Also, I forgot to mention if it is significant: this is running on slackware liunux 14.0
>> 
>> On Mar 5, 2014, at 1:00 PM, Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
>> 
>>> On 03/05/2014 10:22 AM, Israel Brewster wrote:
>>>> My first thought was "Oh, I must have a typo in my
>>>> cleanup routine, such that it is deleting all records rather than only
>>>> those a week old, and it's just that no one has noticed until now". So I
>>>> looked at that, but changing the delete to a select appeared to produce
>>>> the proper results, in that no records were selected:
>>> 
>>> Well it would, if the records only go back to 4 AM this morning. In other words if no records exist before 4 AM today, no records exist before 7 days ago also or am I missing something?
>> 
>> If the delete is correct, you are absolutely right. My first theory, however, was that I made a typo, and the delete was deleting ALL records, in which case changing it to a select would select all records. As it did not, that seems to confirm the delete is correct, and therefore not the problem.
>> 
>>> A sequence is just a special table.
>>> 
>>> So what does SELECT * from the sequence show?
>> 
>> tracking=> SELECT * FROM data_id_seq;
>>  sequence_name | last_value | start_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled | is_called
>> ---------------+------------+-------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
>>  data_id_seq   |       1184 |           1 |            1 | 9223372036854775807 |         1 |           1 |      16 | f         | t
>> 
>> 
>>> 
>>>> 
>>>> Also odd is that my cleanup script runs at 1am. I have records of there
>>>> being new data in the database up to 3:51am, but the oldest record
>>>> currently in the DB is from 4:45am (as specified by the default of now()
>>>> on the column). So I know records were added after my delete command
>>>> ran, but before this reset occurred.
>>> 
>>> I am not sure what you are calling the 'reset'?
>>> Did something happen between 3:51 AM and 4:45 AM?
>> 
>> Yes: All my data was deleted and the sequence reset to 1.
>> 
>>> Also not sure why you call the 4:45 AM record the oldest, when you say you can identify records from 3:51 AM?
>> 
>> As I mentioned, I archive the records to permanent storage. This archive process happens every hour (for various reasons). That is how I know we had records for 3:51 am: they exist in the permanent archive. However, they don't exist in the local database any more.
> 
> Well something is happening. See my notes on logging below to help track down the cause.

Yep.

> 
>> 
>>> 
>>>> 
>>>> So my question is, aside from someone going in and mucking about in the
>>>> wee hours of the morning, what could possibly cause this behavior? What
>>>> sort of event could cause all data to be deleted from the table, and the
>>>> sequence to be reset? Especially while there is an active connection?
>>>> Thanks for any ideas, however wild or off the wall :-)
>>> 
>>> What is in the Postgres/system logs for the time period(s) you mention?
>> 
>> The postgres log has a lot of errors in it, some of which MAY explain the issue. For example:
>> 
>> cp: cannot create regular file '/mnt/pglogs/000000010000000400000094': Permission denied
>> LOG:  archive command failed with exit code 1
>> DETAIL:  The failed archive command was: test ! -f /mnt/pglogs/000000010000000400000094 && cp pg_xlog/000000010000000400000094 /mnt/pglogs/000000010000000400000094
>> WARNING:  transaction log file "000000010000000400000094" could not be archived: too many failures
>> LOG:  received smart shutdown request
>> LOG:  autovacuum launcher shutting down
>> LOG:  shutting down
>> LOG:  database system is shut down
> 
> Might be good to explain your archive setup.

Ok, here goes: We have the primary system which receives the data and handles all requests for said data. There is also a hot standby server keep in sync with streaming replication. The WALs are archived to a NFS share on this machine.

Once an hour a python script runs that a) Selects all unsynced records from the postgresql db, b) stores a subset of them in our permanent archive, and c) marks the previously selected records as synced (UPDATE data SET syncd=true WHERE id in (...) )

Additionally, I have a) a script that runs at 8:00pm every evening that uses pg_dump to dump the contents of the database to a backup file, and b) a script that runs at 8:00 each morning that rsync's various config files and scripts (such as my data retrieval scripts) from the primary machine to a backup location on the secondary machine.

None of the scripts run anywhere near the apparent 4:40ish cutoff time for my data

Make sense? Probably not the best setup, but then that's what happens when you figure out stuff for yourself rather than having formal training :-) I'm DEFINITELY open to suggestions :-)

> 
>> 
>> However, there are no timestamps on any of the entries (can I fix that?), so I don't know if those are current entries, or from back before I got the mount for the logs working. At this time, the mount point IS working correctly, and from what I can tell so is the archive command. The latest entry is from yesterday (modify date on the log shows Mar 5, 9:21, when I was messing with it yesterday), however, so there are no entries from this morning when it happened again. I don't see anything of interest in the syslog or messages log.
> 
> Yes you can, timestamps and a lot more. For all the details go here:
> 
> http://www.postgresql.org/docs/9.2/interactive/runtime-config-logging.html
> 
> At the least I would:
> 
> log_statement = 'mod'
> 
> log_connections = on
> log_disconnections = on
> 
> log_line_prefix = '%u-%m-%x'

I'll get those in the config, and we'll see what happens tomorrow morning. Hopefully that will give more information. Thanks for the link and information!

> 
>> 
>>> 
>>>> 
>>>> -----------------------------------------------
>>>> Israel Brewster
>>>> Computer Support Technician II
>>>> Era Alaska
>>>> 5245 Airport Industrial Rd
>>>> Fairbanks, AK 99709
>>>> (907) 450-7250 x7293
>>>> -----------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Adrian Klaver
>>> adrian.klaver@xxxxxxxxxxx

-----------------------------------------------
Israel Brewster
Computer Support Technician II
Era Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7250 x7293
-----------------------------------------------

BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:israel@xxxxxxxxxxxxxxxxxx
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general