Re: Dealing with ordered hierarchies

Alban Hertroys <haramrae@xxxxxxxxx> · Mon, 24 Jul 2017 12:32:59 +0200

> On 24 Jul 2017, at 9:02, Tim Uckun <timuckun@xxxxxxxxx> wrote:
> 
> I have read many articles about dealing with hierarchies in postgres including nested sets, ltree, materialized paths, using arrays as parentage,  CTEs etc but nobody talks about the following scenario.
> 
> Say I have a hierarchy like this
> 
> 1
> 1.1
> 1.1.1
> 1.1.2
> 1.2
> 1.3
> 2
> 2.1
> 
> In this hierarchy the order is very important and I want to run frequent(ish) re-ordering of both subsets and entire trees and even more frequent inserts.

Since they're hierarchies, the order is already in the structure of the data. Do you really need to add it to the data or would it suffice to add it to the query result?

If that's the case, you only need a simple ordering number per branch, like 1, 2, 3, etc. The full path (ie. '1.1.3') gets generated in the query.

I regularly generate structures like your above example using recursive CTE's. The "path" helps to get the results in the correct order for starters (although you're in for a surprise if any of your levels go past 9 in the above). It's great how you can "trickle" all kinds of calculations through the hierarchy using CTE's.

Something like this should help to get you started (untested, I usually do this in Oracle, which has several peculiarities):

with recursive hierarchy (parent, node, sequence_number, path) as (
	select null, node, sequence_number, sequence_number::text from table
	union all
	select h.node, t.node, t.sequence_number, h.path || '.' || t.sequence_number::text
	  from table t
	  join hierarchy h on (t.parent = h.node)
)
select node, path
  from hierarchy

Where the table "table" has fields:
	parent	-- parent node
	node	-- actual node
	sequence_number	-- Order of sequence of this node within its parent branch

You may need to add a surrogate key if your parent/child combinations are otherwise not unique. That would then also be the way to address a node directly (otherwise it would be (parent, node)).

For the sequence_number I'd probably just use an actual sequence generator with a large enough gap to prevent problems with reordering items later on (increment by 10 for example). You will also want to pad the sequence numbers in the "path" column with leading zeroes (otherwise 10 sorts between 1 and 2, etc.), enough that you won't run out of numbers per level.

If you require your sequence numbers to be subsequent in the result: You can add a field with such numbering based on the existing sequence_numbers, by using a windowing function in each branch of the union - it's down to a fairly basic row numbering problem at this point.

> Scenario 1: I want to insert a child into the 1.1 subtree.  The next item should be 1.1.3 and I can't figure out any other way to do this other than to subquery the children and to figure out the max child ID, add one to it which is a race condition waiting to happen.

You would first need to determine which node is the parent node by traversing the hierarchy up to the point of insertion and use the (parent, node) or surrogate key fields to append under. Similar to using '1.1', really.

> Scenario 2: I now decide the recently inserted item is the second most important so I reset the ID to 1.1.2 and then increment 1.1.2 (and possibly everything below).  Again this is both prone to race conditions and involves a heavy update.

No need to bother with that (much) with the above approach. And if you do run out of gaps, you can fairly simply update all the sequence numbers under the same parent without causing concurrency issues and without requiring locks/synchronisation.

> Is there a better way to deal with this or is the complexity unavoidable?

I think it's better, but I don't think its ideal. It's fairly complicated to understand, for one thing, which can cause problems for maintenance (I have colleagues who don't dare to touch my queries, for example).

> I should state that like most database reads will be much more frequent than writes and inserts will be more frequent than updates (re-ordering)

More of the logic (and thus system load) gets moved to the read-side of things, that's probably a drawback, but most of it is just keeping state and counting. I don't expect that to be all that much.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general