Re: PG_DUMP very slow because of STDOUT ??

Andras Fabian <Fabian@xxxxxxxxxx> · Mon, 12 Jul 2010 13:03:49 +0000

This STDOU issue gets even weirder. Now I have set up our two new servers (identical hw/sw) as I would have needed to do so anyways. After having PG running, I also set up the same test scenario as I have it on our problematic servers, and started the COPY-to-STDOUT experiment. And you know what? Both new servers are performing well. No hanging, and the 3 GByte test dump was written in around 3 minutes (as expected). To make things even more complicated ... I went back to our production servers. Now, the first one - which I froze up with oprofile this morning and needed a REBOOT - is performing well too! It needed 3 minutes for the test case ... WTF? BUT, the second production server, which did not have a reboot, is still behaving badly.
Now I tried to dig deeper (without killing a production server again) ... and came to comparing the outputs of PS (with '-fax' parameter then, '-axl'). Now I have found something interesting:
- all fast servers show the COPY process as being in the state Rs ("runnable (on run queue)")
- on the still slow server, this process is in 9 out of 10 samples in Ds ("uninterruptible sleep (usually IO)") 

Now, this "Ds" state seems to be something unhealthy - especially if it is there almost all the time - as far as my first reeds on google show (and although it points to IO, there is seemingly only very little IO, and IO-wait is minimal too). I have also done "-axl" with PS, which brings the following line for our process:
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
1  5551  2819  4201  20   0 5941068 201192 conges Ds ?          2:05 postgres: postgres musicload_cache [local] COPY"

Now, as far as I understood from my google searches, the column WCHAN shows, where in the kernel my process is hanging. Here it says "conges". Now, can somebody tell me, what "conges" means ???? Or do I have other options to get out even more info from the system (maybe without oprofile - as it already burned my hand :-).

And yes, now I see a reboot as a possible "Fix", but that would not ensure me, that the problem will not resurface. So, for the time being, I will leave my current second production server as is ... so I can further narrow down the potential reasons of this strange STDOUT slow down (especially I someone ha s a tip for me :-)

Andras Fabian

(in the meantime my "slow" server finished the COPY ... it took 46 minutes instead of 3 minutes on the fast machines ... a slowdown of factor 15). 

-----Ursprüngliche Nachricht-----
Von: Andras Fabian 
Gesendet: Montag, 12. Juli 2010 10:45
An: 'Tom Lane'
Cc: pgsql-general@xxxxxxxxxxxxxx
Betreff: AW:  PG_DUMP very slow because of STDOUT ?? 

Hi Tom (or others),

are there some recommended settings/ways to use oprofile on a situation like this??? I got it working, have seen a first profile report, but then managed to completely freeze the server on a second try with different oprofile settings (next tests will go against the newly installed - next and identical - new servers). 

Andras Fabian

-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:tgl@xxxxxxxxxxxxx] 
Gesendet: Freitag, 9. Juli 2010 15:39
An: Andras Fabian
Cc: pgsql-general@xxxxxxxxxxxxxx
Betreff: Re:  PG_DUMP very slow because of STDOUT ?? 

Andras Fabian <Fabian@xxxxxxxxxx> writes:
> Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on out new machine?

Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?

On a Red Hat box I would try using oprofile to see where the bottleneck
is ... don't know if that's available for Ubuntu.

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general