On 1/2/24 11:23 PM, arun chirappurath wrote: > Do we have any open source tools which can be used to create sample data > at scale from our postgres databases? > Which considers data distribution and randomness I would suggest to use the most common tools whenever possible, because then if you want to discuss results with other people (for example on these mailing lists) then you're working with data sets that are widely and well understood. The most common tool for PostgreSQL is pgbench, which does a TPCB-like schema that you can scale to any size, always the same [small] number of tables/columns and same uniform data distribution, and there are relationships between tables so you can create FKs if needed. My second favorite tool is sysbench. Any number of tables, easily scale to any size, standardized schema with small number of colums and no relationships/FKs. Data distribution is uniformly random however on the query side it supports a bunch of different distribution models, not just uniform random, as well as queries processing ranges of rows. The other tool that I'm intrigued by these days is benchbase from CMU. It can do TPCC and a bunch of other schemas/workloads, you can scale the data sizes. If you're just looking at data generation and you're going to make your own workloads, well benchbase has a lot of different schemas available out of the box. You can always hand-roll your schema and data with scripts & SQL, but the more complex and bespoke your performance test schema is, the more work & explaining it takes to get lots of people to engage in a discussion since they need to take time to understand how the test is engineered. For very narrowly targeted reproductions this is usually the right approach with a very simple schema and workload, but not commonly for general performance testing. -Jeremy -- http://about.me/jeremy_schneider