Re: How To: A large [2D] matrix, 100,000+ rows/columns

Alban Hertroys <haramrae@xxxxxxxxx> · Fri, 9 Jun 2023 08:59:25 +0200

> On 9 Jun 2023, at 04:17, Pat Trainor <pat.trainor@xxxxxxxxx> wrote:

(…)

> Imagine something akin to stocks, where you have a row for every stock, and a column for every stock. Except where the same stock is the row & col, a number is at each X-Y (row/column), and that is the big picture. I need to have a very large matrix to maintain & query, and if not (1,600 column limit), then how could such data be broken down to work?

If your matrix contains values that are all of the same type, as matrices usually do, then a matrix can be described as the Carthesian product of rows and columns, with values connecting those.

For rows and columns you could enumerate them using generate_series() or a pair of recursive CTEs, or you could put them into their own table.
For the values (or cells), a tuple of (row, column, value) would be sufficient.

Then in the end, the matrix would be a presentation of the left joins of the Carthesian product of rows and columns with your cell values. The left joins are to account for missing cell values (empty cells), or you could explicitly add tuples for those with an ‘empty’ value.

For presentation, I would use something like Python Pandas and the xlsxwriter.

Data-entry is going to be a bit of a pain if you cannot automate it, and it’s not going to be very space-efficient, but it does fit the relational model this way and it would be easy to expand the matrix in either direction.

> By wanting postgresql as a solution, am I round-hole, square-pegging myself?

I expect that there are solutions that were explicitly designed for handling (large) matrices and that those would perhaps perform better.

> I don't mind keeping, say, a 1,500 column max per table, but then adding new items (stocks in the above analogy) might make it difficult to keep track of things... 

That’s also a possibility, but that sort of pushes the column lookups down to the catalog level and induces overhead on all other catalog lookups as well. It’s not a huge amount though.

An alternative approach would be a table of rows (or columns) with each an array of values, especially since you don’t expect many updates. That would be a first attempt at optimisation if the pure relational model doesn’t work out.

> Hoping someone has tackled this before, and performance isn't too big a concern, as the data changes seldom.

Not in practice, AFAIR, but I was thinking of a solution like this for small matrices (Sudoku’s, I hate the things, but I need some solutions to a few to help me with a number of test scripts).

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.