Re: Why could different data in a table be processed with different performance?

Fabio Pardi <f.pardi@xxxxxxxxxxxx> · Thu, 27 Sep 2018 11:25:58 +0200

On 09/26/2018 07:15 PM, Vladimir Ryabtsev wrote:
>> Since you have a very big toast table, given you are using spinning
> disks, I think that increasing the block size will bring benefits.
> But will it worsen caching? I will have lesser slots in cache. Also will
> it affect required storage space?

I think in your case it will not worsen the cache. You will have lesser
slots in the cache, but the total available cache will indeed be
unchanged (half the blocks of double the size). It could affect space
storage, for the smaller blocks. Much depends which block size you
choose and how is actually your data distributed in the ranges you
mentioned. (eg: range 10K -20 might be more on the 10 or more on the 20
side.).

Imagine you request a record of 24 KB, and you are using 8KB blocks. It
will result in 3 different block lookup/request/returned. Those 3 blocks
might be displaced on disk, resulting maybe in 3 different lookups.
Having all in one block, avoids this problem.
The cons is that if you need to store 8KB of data, you will allocate 24KB.
You say you do not do updates, so it might also be the case that when
you write data all at once (24 KB in one go) it goes all together in a
contiguous strip. Therefore the block size change here will bring nothing.
This is very much data and usage driven. To change block size is a
painful thing, because IIRC you do that at db initialization time

Similarly, if your RAID controller uses for instance 128KB blocks, each
time you are reading one block of 8KB, it will return to you a whole
128KB chunk, which is quite a waste of resources.

If your 'slow' range is maybe fragmented here and there on the disk, not
having a proper alignment between Postgres blocks/ Filesystem/RAID
might worsen the problem of orders of magnitude. This is very true on
spinning disks, where the seek time is noticeable.

Note that trying to set a very small block size has the opposite effect:
you might hit the IOPS of your hardware, and create a bottleneck. (been
there while benchmarking some new hardware)

But before going through all this, I would first try to reload the data
with dump+restore into a new machine, and see how it behaves.

Hope it helps.

regards,

fabio pardi

> 
>>> consecutive runs with SAME parameters do NOT hit the disk, only the
> first one does, consequent ones read only from buffer cache.
>> I m  a bit confused.. every query you pasted contains 'read':
>>        Buffers: shared hit=50 read=2378
>> and 'read' means you are reading from disk (or OS cache). Or not? 
> Yes, sorry, it was just my misunderstanding of what is "consecutive". To
> make it clear: I iterate over all data in table with one request and
> different parameters on each iteration (e.g. + 5000 both borders), in
> this case I get disk reads on each query run (much more reads on "slow"
> range). But if I request data from an area queried previously, it reads
> from cache and does not hit disk (both ranges). E.g. iterating over 1M
> of records with empty cache takes ~11 minutes in "fast" range and ~1
> hour in "slow" range, while on second time it takes only ~2 minutes for
> both ranges (if I don't do drop_caches).
> 
> Regards,
> Vlad
>