Re: Design strategy for table with many attributes

Guyren Howe <guyren@xxxxxxxxx> · Thu, 4 Jul 2024 16:29:13 -0700

Ultimately, the way you should store the data depends on how you will use it. When you retrieve these values, what are the different ways you’ll be using them?

Normalised representations are more flexible, and the pragmatic, performance-based consideration is that all the values in a row are always effectively retrieved together. So if you reasonably often have a query that only accesses the creation date and transaction id, then it will be pretty slow if you are also always retrieving 500 other columns you don’t need.

So: you might often pull all the values *other* than the attributes (creation date, creator, job information, whatever) together. This argues that those values should be in one table, and the attributes in another.

Will you usually be using *all* of the attributes for a particular transaction together in the same operation? It might make sense to store them in eg an array in that case. But this significantly reduces the speed of accessing particular attributes separately.

It is likely that you will want to slice things by particular named attribute across many transactions. This argues for the more normalised form, as does the general principle of doing things in the way that is most general and flexible.

When considering how the data will be used, please consider not only the developers of your current application(s), but also eg data analysts, managers, future applications etc.

The less specific you can be about how you want to use the data, the more you should lean toward fully normalising.

Note also that you can store your data in a normalised and maximally flexible form, but also use triggers or batch jobs to gather various permutations of the data for specific purposes. If you really do, say, both have some major part of your project that uses all the attributes on a given transaction together, but you also have other uses, you may want to store both the normalised/attribute table and the “all the values together” version.

Even if you want to store “all the values together”, it may well be better to use an array, JSON or HStore, rather than having a super-wide table. JSON would eg let you enumerate all the column names (for example) and employ Postgres’s really nice JSON query features.

> On Jul 4, 2024, at 12:37, Lok P <loknath.73@xxxxxxxxx> wrote:
> 
> Hello,
> In one of the applications we are getting transactions in messages/events format and also in files and then they are getting parsed and stored into the relational database. The number of attributes/columns each transaction has is ~900+. Logically they are part of one single transaction and should be stored in one table as one single row. There will be ~500million such transactions each day coming into the system. And there will be approx ~10K peak write TPS and 5K read TPS in target state. This system has a postgres database as a "source of truth" or OLTP store. And then data moves to snowflakes for the olap store.
> 
> Initially when the system was designed the number of attributes per transaction was <100 but slowly the business wants to keep/persist other attributes too in the current system and the number of columns keep growing.
> 
> However, as worked with some database systems , we get few suggestions from DBA's to not have many columns in a single table. For example in oracle they say not to go beyond ~255 columns as then row chaining and row migration type of things are going to hunt us. Also we are afraid concurrent DMLS on the table may cause this as a contention point. So I wanted to understand , in such a situation what would be the best design approach we should use irrespective of databases? Or say, what is the maximum number of columns per table we should restrict? Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?
> 
> Regards
> Lok