On Mon, 2006-05-15 at 14:29 -0700, Jeff Frost wrote: > On Mon, 15 May 2006, Simon Riggs wrote: > > > On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote: > >> I've run into a problem with a PITR setup at a client. The problem is that > >> whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has > >> problems > > > > What kind of problems? > > It becomes unwritable for whatever reason CIFS shares become unwritable. It's > a windows 2003 NAS device and a reboot solves the problem, but it leaves no > event logs on the windows side of things, so difficult to determine the root > cause. You should be able to re-create this problem without the database being involved. Just set up a driver program over the top of the archive script so it flies in a tighter loop than the archiver would make it. If you still get the Windows NAS error... well, I'll leave that to you. > >> , it seems that the current client connection gets blocked and this > >> eventually builds up to a "sorry, too many clients already" error. Tell us more about what the blockage looks like. We may yet thank Windows for finding a bug, but I'm not sure yet. > > This sounds like the archiver keeps waking up and trying the command, > > but it fails, yet that request is causing a resource leak on the NAS. > > Eventually, archiver retrying the command eventually fails. Or am I > > misunderstanding your issues? > > that's possible. Does the archiver use a DB connection whenever it tries to > run archive_command? Not at all. > If so, then that's almost certainly the problem. I > suspect a faster timeout on the CIFS mount would fix the issue as well, but I > didn't see any such options in the mount.cifs manpage. > > > The archiver is designed around the thought that *attempting* to archive > > is a task that it can do indefinitely without a problem; its up to you > > to spot that the link is down. > > > > We can put something in to make the retry period elongate, but you'd > > need to put a reasonable case for how that would increase robustness. > > That all sounds perfectly reasonable. If the archiver is using up a > connection for each archive_command issued, then I suspect that's our problem, > as there were also lots of debug logs showing that the db was trying to > archive several WAL files at near the same time, likely pushing us over our > 100 connection limit. Oh, you mean database clients cannot connect. I thought you meant you were getting a CIFS client connection error from the archiver. That's wierd. > If the archiver does not use up a connection, then I > suppose I don't know what's actually going on unless postgres blocks the > commit of the transaction which triggered the archive_command until the > archive command finishes (or fails). I think you need to show the database log covering the period in error. Are you running out of disk space in the database directory? Can you check again that pg_xlog and pg_xlog/archive_status is definitely not on the NAS? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com