Re: Incredibly slow restore times after 9.0>9.2 upgrade

Tomas Vondra <tv@xxxxxxxx> · Thu, 30 Oct 2014 20:08:36 +0100

On 29.10.2014 16:12, jmcdonagh wrote:
> Hi Tomas- thank you for your thoughtful response!
> 
> 
> Tomas Vondra wrote
>> On 28.10.2014 21:55, jmcdonagh wrote:
>>> Hi, we have a nightly job that restores current production data to
>>> the development databases in a 'warm spare' database so that if the
>>> developers need fresh data, it's ready during the day. When we moved
>>> from 9.0 to 9.2 suddenly the restores began to take from a few hours
>>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
>>> volumes, warmed them up, threw IOPS at them, pretty much all the
>>> standard stuff to get more disk performance.
>>
>> So, if I understand it correctly, you've been restoring into 9.0, then
>> you switched to 9.2 and it's much slower?
> 
> Yes- but since the move was done utilizing snapshots so the move
> involves new volumes, but I have created new volumes since then to
> rule out a single bad volume.

My advice would be to do some basic low-level performance tests to rule
this out. Use dd or (better) fio to test basic I/O performance, it's
much easier to spot issues that way.

> Tomas Vondra wrote
>> Is the 9.2 configured equally to 9.0? If you do something like this
>>
>>   SELECT name, setting
>>     FROM pg_settings
>>    WHERE source = 'configuration file';
>>
>> on both versions, what do you get?
> 
> I no longer have the 9.0 box up but we do track configuration via
> puppet and git. The only configuration change made for 9.2 is:
> 
> -#standard_conforming_strings = off
> +standard_conforming_strings = off

Compared to 9.0, I suppose? Anyway, post the non-default config values
at least for 9.2, please.

> Cause we have an old app that needs this setting on otherwise we'd
> spend a lot of time trying to fix it.

I doubt standard_conforming_strings has anything to do with the issues.

> Tomas Vondra wrote
>>> Here's the thing, the disk isn't saturated. The behavior I'm seeing 
>>> seems very odd to me; I'm seeing the source disk which holds the dump
>>> saturated by reads, which is great, but then I just see nothing being
>>> written to the postgres volume. Just nothing happening, then a
>>> small burst. There is no write queue backup on the destination disk
>>> either. if I look at pg_stat_activity I'll see something like:
>>>
>>> COPY salesforce_reconciliation (salesforce_id, email,
>>> advisor_salesforce_id, processed) FROM stdin
>>>
>>> and even for small tables, that seems to take a very long time even
>>> though the destination disk is almost at 0 utilization.
>>
>> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a
>> disk or something else? Or maybe network, because you're using EBS?
>>
>> What do you mean by 'utilization'? How do you measure that?
> 
> The bottleneck is I/O somehow. I say somehow, because I see iowait 
> averaging about 50% between two CPUs, but there is just no writes to 
> the destination EBS volume really happening, just reads from the
> disk where the source dump is located, then bursts of writes to the
> destination volume every so often. It's kind of puzzling. This is
> happening on multiple database servers, in multiple availability
> zones. Driving me bonkers.
> 
> What I mean by utilization is util% from iostat -m -x 1.

I find this rather contradictory. At one moment you say the disk isn't
saturated, the next moment you say you're I/O bound.

Also, iowait (as reported e.g. by 'top') is tricky to interpret
correctly, especially on multi-cpu systems (nice intro to the complexity
[1]). It's really difficult to interpret the 50% iowait without more
info about what's happening on the machine.

IMHO, the utilization (as reported by iotop) is much easier to
interpret, because it means '% of time the device was servicing
requests'. It has issues too, because 100% does not mean 'saturated'
(especially on RAID arrays that can service multiple requests in
parallel), but it's better than iowait.

If I had to guess based from your info, I'd bet you're CPU bound, so
there's very little idle time and about 50% of it is spent waiting for
I/O requests (hence the 50% iowait). But in total the amount of I/O is
very small, so %util is ~0.

Please, post a few lines of 'iostat -x -k 1' output. Samples from 'top'
and 'vmstat 1' would be handy too.

regards
Tomas

[1] http://veithen.blogspot.cz/2013/11/iowait-linux.html

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance