On Thu, Jun 10, 2021 at 7:23 PM Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
On Wed, Jun 09, 2021 at 10:55:08PM -0500, Don Seiler wrote:
> On Wed, Jun 9, 2021, 21:03 P C <puravc@xxxxxxxxx> wrote:
>
> > I agree, its confusing for many and that confusion arises from the fact
> > that you usually talk of shared_buffers in MB or GB whereas hugepages have
> > to be configured in units of 2mb. But once they understand they realize its
> > pretty simple.
> >
> > Don, we have experienced the same not just with postgres but also with
> > oracle. I havent been able to get to the root of it, but what we usually do
> > is, we add another 100-200 pages and that works for us. If the SGA or
> > shared_buffers is high eg 96gb, then we add 250-500 pages. Those few
> > hundred MBs may be wasted (because the moment you configure hugepages, the
> > operating system considers it as used and does not use it any more) but
> > nowadays, servers have 64 or 128 gb RAM easily and wasting that 500mb to
> > 1gb does not hurt really.
>
> I don't have a problem with the math, just wanted to know if it was
> possible to better estimate what the actual requirements would be at
> deployment time. My fallback will probably be you did and just pad with an
> extra 512MB by default.
It's because the huge allocation isn't just shared_buffers, but also
wal_buffers:
| The amount of shared memory used for WAL data that has not yet been written to disk.
| The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, ...
.. and other stuff:
src/backend/storage/ipc/ipci.c
* Size of the Postgres shared-memory block is estimated via
* moderately-accurate estimates for the big hogs, plus 100K for the
* stuff that's too small to bother with estimating.
*
* We take some care during this phase to ensure that the total size
* request doesn't overflow size_t. If this gets through, we don't
* need to be so careful during the actual allocation phase.
*/
size = 100000;
size = add_size(size, PGSemaphoreShmemSize(numSemas));
size = add_size(size, SpinlockSemaSize());
size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
sizeof(ShmemIndexEnt)));
size = add_size(size, dsm_estimate_size());
size = add_size(size, BufferShmemSize());
size = add_size(size, LockShmemSize());
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
size = add_size(size, TwoPhaseShmemSize());
size = add_size(size, BackgroundWorkerShmemSize());
size = add_size(size, MultiXactShmemSize());
size = add_size(size, LWLockShmemSize());
size = add_size(size, ProcArrayShmemSize());
size = add_size(size, BackendStatusShmemSize());
size = add_size(size, SInvalShmemSize());
size = add_size(size, PMSignalShmemSize());
size = add_size(size, ProcSignalShmemSize());
size = add_size(size, CheckpointerShmemSize());
size = add_size(size, AutoVacuumShmemSize());
size = add_size(size, ReplicationSlotsShmemSize());
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, SnapMgrShmemSize());
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
size = add_size(size, AsyncShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
/* freeze the addin request size and include it */
addin_request_allowed = false;
size = add_size(size, total_addin_request);
/* might as well round it off to a multiple of a typical page size */
size = add_size(size, 8192 - (size % 8192));
BTW, I think it'd be nice if this were a NOTICE:
| elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", allocsize);
Great detail. I did some trial and error around just a few variables (shared_buffers, wal_buffers, max_connections) and came up with a formula that seems to be "good enough" for at least a rough default estimate.
The pseudo-code is basically:
ceiling((shared_buffers + 200 + (25 * shared_buffers/1024) + 10*(max_connections-100)/200 + wal_buffers-16)/2)
This assumes that all values are in MB and that wal_buffers is set to a value other than the default of -1 obviously. I decided to default wal_buffers to 16MB in our environments since that's what -1 should go to based on the description in the documentation for an instance with shared_buffers of the sizes in our deployments.
This formula did come up a little short (2MB) when I had a low shared_buffers value at 2GB. Raising that starting 200 value to something like 250 would take care of that. The limited testing I did based on different values we see across our production deployments worked otherwise. Please let me know what you folks think. I know I'm ignoring a lot of other factors, especially given what Justin recently shared.
The remaining trick for me now is to calculate this in chef since shared_buffers and wal_buffers attributes are strings with the unit ("MB") in them, rather than just numerical values. Thinking of changing that attribute to be just that and assume/require MB to make the calculations easier.
Don Seiler
www.seiler.us
www.seiler.us