Re: Millions of tables

Álvaro Hernández Tortosa <aht@xxxxxxxxxx> · Mon, 26 Sep 2016 11:28:54 +0300

On 26/09/16 05:50, Greg Spiegelberg wrote:
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a 
time has said not to have millions of tables.  I too have long 
believed it until recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) 
for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on 
those tables.  Table creation initially took 0.018031 secs, average 
0.027467 and after tossing out outliers (qty 5) the maximum creation 
time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 
8.435049 seconds.  Tables were created by a single process. Do note 
that table creation is done via plpgsql function as there are other 
housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as 
VACUUM and ANALYZE should not be done.  Each will run forever and 
cause much grief.  Backups are problematic in the traditional pg_dump 
and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance 
(I am abusing it in my test case) are no-no's.  A system or database 
crash could take potentially hours to days to recover.  There are 
likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at 
DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face 
it, it's antiquated and don't get me started on "Hadoop".  I looked at 
many others and ultimately the recommended use of each vendor was to 
have one table for all data.  That overcomes the millions of tables 
problem, right?

Problem with the "one big table" solution is I anticipate 1,200 
trillion records.  Random access is expected and the customer expects 
<30ms reads for a single record fetch.

No data is loaded... yet  Table and index creation only. I am 
interested in the opinions of all including tests I may perform.  If 
you had this setup, what would you capture / analyze?  I have a job 
running preparing data.  I did this on a much smaller scale (50k 
tables) and data load via function allowed close to 6,000 
records/second.  The schema has been simplified since and last test 
reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test. Takers?

I can't promise immediate feedback but will do my best to respond with 
results.

TIA,
-Greg

    Hi Greg.

    This is a problem (creating a large number of tables; really large 
indeed) that we researched in my company a while ago. You might want to 
read about it: https://www.pgcon.org/2013/schedule/events/595.en.html

    Cheers,

    Álvaro

--

Álvaro Hernández Tortosa

-----------
8Kdata

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance