On 09/26/2018 07:15 PM, Vladimir Ryabtsev wrote: >> Since you have a very big toast table, given you are using spinning > disks, I think that increasing the block size will bring benefits. > But will it worsen caching? I will have lesser slots in cache. Also will > it affect required storage space? I think in your case it will not worsen the cache. You will have lesser slots in the cache, but the total available cache will indeed be unchanged (half the blocks of double the size). It could affect space storage, for the smaller blocks. Much depends which block size you choose and how is actually your data distributed in the ranges you mentioned. (eg: range 10K -20 might be more on the 10 or more on the 20 side.). Imagine you request a record of 24 KB, and you are using 8KB blocks. It will result in 3 different block lookup/request/returned. Those 3 blocks might be displaced on disk, resulting maybe in 3 different lookups. Having all in one block, avoids this problem. The cons is that if you need to store 8KB of data, you will allocate 24KB. You say you do not do updates, so it might also be the case that when you write data all at once (24 KB in one go) it goes all together in a contiguous strip. Therefore the block size change here will bring nothing. This is very much data and usage driven. To change block size is a painful thing, because IIRC you do that at db initialization time Similarly, if your RAID controller uses for instance 128KB blocks, each time you are reading one block of 8KB, it will return to you a whole 128KB chunk, which is quite a waste of resources. If your 'slow' range is maybe fragmented here and there on the disk, not having a proper alignment between Postgres blocks/ Filesystem/RAID might worsen the problem of orders of magnitude. This is very true on spinning disks, where the seek time is noticeable. Note that trying to set a very small block size has the opposite effect: you might hit the IOPS of your hardware, and create a bottleneck. (been there while benchmarking some new hardware) But before going through all this, I would first try to reload the data with dump+restore into a new machine, and see how it behaves. Hope it helps. regards, fabio pardi > >>> consecutive runs with SAME parameters do NOT hit the disk, only the > first one does, consequent ones read only from buffer cache. >> I m a bit confused.. every query you pasted contains 'read': >> Buffers: shared hit=50 read=2378 >> and 'read' means you are reading from disk (or OS cache). Or not? > Yes, sorry, it was just my misunderstanding of what is "consecutive". To > make it clear: I iterate over all data in table with one request and > different parameters on each iteration (e.g. + 5000 both borders), in > this case I get disk reads on each query run (much more reads on "slow" > range). But if I request data from an area queried previously, it reads > from cache and does not hit disk (both ranges). E.g. iterating over 1M > of records with empty cache takes ~11 minutes in "fast" range and ~1 > hour in "slow" range, while on second time it takes only ~2 minutes for > both ranges (if I don't do drop_caches). > > Regards, > Vlad >