Environment: PostgreSQL 9.6.6 installed from yum repository. Oracle Linux 7 EL x64. Dell servers with Raid 5 (hw).
I was testing our database backup system (based on pgBarman), and discovered that one base file is corrupt in our standby database server. The file is OK in master server, but has 0 bytes in size in standby server. Looking master and standby servers there is no indication that the problem exists - replication is running fine.
Evidences: On master server: [root@server2 1106839]#
find 6302536 -exec stat \{\} \;
File: “6302536”
Size: 16793600
Blocks: 32800 IO Block: 4096 arquivo comum
Device: f902h/63746d
Inode: 10618465 Links: 1
Access:
(0600/-rw-------) Uid: ( 26/postgres) Gid: ( 26/postgres)
Access: 2017-12-08
21:35:38.670841051 -0200
Modify: 2017-12-21
22:51:40.706074439 -0200
Change: 2017-12-21
22:51:40.706074439 -0200
Birth: -
On standby server: [root@server3 1106839]#
find 6302536 -exec stat \{\} \;
File: “6302536”
Size: 0
Blocks: 0 IO Block: 4096 arquivo comum vazio
Device: f901h/63745d
Inode: 391519656 Links: 1
Access:
(0600/-rw-------) Uid: ( 26/postgres) Gid: ( 26/postgres)
Access: 2017-12-09
15:50:47.469135640 -0200
Modify: 2017-12-09
15:50:47.469135640 -0200
Change: 2017-12-09
15:50:47.469135640 -0200
Birth: -
After long investigation, I discovered that if I execute a query on standby server: < 2017-12-22 11:20:22.417 -02 > ERROR: could not read
block 0 in file "base/1106839/6302536": read only 0 of 8192 bytes
Very same query on server works fine. And there is no replication error - everything is in sync between these two servers (I know, I'm begin to be repetitive). I've about 30 servers with same setup, and this only has this flaw. The only difference is that this database is about 3 times larger than the others (about 90Gb in size). Server and slave have 23ms of network lag - which seems not be a
problem for the other databases in the same server.
Any advice? --
|