Re: What exactly is postgres doing during INSERT/UPDATE ?

Merlin Moncure <mmoncure@xxxxxxxxx> · Mon, 31 Aug 2009 10:38:25 -0400

On Sun, Aug 30, 2009 at 7:38 PM, Greg Stark<gsstark@xxxxxxx> wrote:
> On Sun, Aug 30, 2009 at 11:56 PM, Merlin Moncure<mmoncure@xxxxxxxxx> wrote:
>> 192k written
>>  raid 10: six writes
>>  raid 5: four writes, one read (but the read and one of the writes is
>> same physical location)
>>
>> now, by 'same physical' location, that may mean that the drive head
>> has to move if the data is not in cache.
>>
>> I realize that many raid 5 implementations tend to suck.  That said,
>> raid 5 should offer higher theoretical performance for writing than
>> raid 10, both for sequential and random.
>
> In the above there are two problems.
>
> 1) 192kB is not a random access pattern. Any time you're writing a
> whole raid stripe or more then RAID5 can start performing reasonably
> but that's not random, that's sequential i/o. The relevant random i/o
> pattern is writing 8kB chunks at random offsets into a multi-terabyte
> storage which doesn't fit in cache.
>
> 2) It's not clear but I think you're saying "but the read and one of
> the writes is same physical location" on the basis that this mitigates
> the costs. In fact it's the worst case. It means after doing the read
> and calculating the parity block the drive must then spin a full
> rotation before being able to write it back out. So instead of an
> average latency of 1/2 of a rotation you have that plus a full
> rotation, or 3x as much latency before the write can be performed as
> without raid5.
>
> It's not a fault of the implementations, it's a fundamental problem
> with RAId5. Even a spectacular implementation of RAID5 will be awful
> for random access writes. The only saving grace some hardware
> implementations have is having huge amounts of battery backed cache
> which mean that they can usually buffer all the writes for long enough
> that the access patterns no longer look random. If you buffer enough
> then you can hope you'll eventually overwrite the whole stripe and can
> write out the new parity without reading the old data. Or failing that
> you can perform the reads of the old data when it's convenient because
> you're reading nearby data effectively turning it into sequential i/o.

I agree, that's good analysis.  The main point I was making was that
if you have say a 10 disk raid 5, you don't involve 10 disks, only
two...a very common misconception.  I made another mistake that you
didn't catch: you need to read *both* the data drive and the parity
drive before writing, not just the parity drive.

I wonder if flash SSD are a better fit for raid 5 since the reads are
much cheaper than writes and there is no rotational latency.  (also,
$/gb is different, and so are the failure cases).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance