Re: Puzzline CROSS JOIN when doing RECURSIVE CTE

Alban Hertroys <haramrae@xxxxxxxxx> · Mon, 18 Apr 2022 21:14:11 +0200

> On 18 Apr 2022, at 14:51, Pól Ua Laoínecháin <linehanp@xxxxxx> wrote:
> 
> Hi Alban, and many thanks for your input.
> 
>> My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a tree), unless perhaps you intended to actually traverse the HTML document hierarchy?
> 
> This is basically an exercise on my part.
> 
> The question that I'm trying to answer  is here:
> 
> https://stackoverflow.com/questions/70574881/how-can-get-html-inner-tag-in-posgresql
> 
> I've already answered it in 3 different ways - but I was trying to do
> it with RCTEs in order to improve my comprehension of them.
> 
> So, basically, I want to pick out a subsection of text from a "passage".
> 
> So then, I wanted to establish a true/false state for the lines that I
> want and don't want, going through line by line. I know that the RCTE
> is  a very contrived way of doing this, but it's for learning really.

Considering that you’re already looking at the elements of a parsed DOM tree, the exercise boils down to traversing that tree. Due to how xmlparse() is implemented, you probably already get them in the right order even when not using an explicit order by. That is, if you’re looking for a DFT (depth first traversal) as opposed to a BFT (breadth first).

One of the difficulties here is that there are some CDATA sections involved with more XML in them. My guess is that that’s the data that you’re actually after, but that’s just a matter of entering the document with the correct path I suppose?

> I wonder if you could be so kind as to give me  a "skeleton" RCTE for
> this - I've been staring at this for hours - and it's not that I'm
> lazy or haven't studied RCTEs - I wrote this RCTE
> 
> https://stackoverflow.com/a/71674990/470530
> 
> recently, so it's not as if I'm completely ignorant of RCTEs - I'm
> just stuck in a rut. Any help would be appreciated.

You would first need to determine the root node(s). Those are the ones w/o parents, or you may have some other way of determining those.

Next is finding all nodes that have an earlier node as their parent.
You could go an extra step here with preserving the order of the siblings in the document, by numbering nodes (directly) under the same parent.
I usually build an ltree structure with that information, while traversing the tree - that gets you an ltree with entries (1, 1.1, 1.1.1, 1.1.2, 1.2.1, etc) that you then can use for the final order by, for example.

In case you didn’t know, ltree is a module you can install. I find it still very useful in tree traversals. The one drawback I see is that for these scenario’s you’d ideally want an ltree based on integers, such that 10 sorts after 9 instead of between 1 and 2. Padding enough zeroes before the ltree text items is a bit of an extra hassle that I’d prefer to do without.

I haven’t actually looked at what DOM navigation functions exist for PG, so this is more or less pseudo code. Worse, my local copy of PG was compiled w/o XML support, so I don’t know what kind of result the query from that SO article produces. But then again, I don’t really know what you’re after anyway, so...

This is basically how I would go about it.

with recursive
-- First we need to get the DOM-tree parsed (this is not actually recursive)
domtree as (
	select node
	  from xmlparse(document(‘<root>...</root>'))
),
-- Next we can traverse it
cte (node, hierarchy, n) as (
	select node, 1::text::ltree, 1
	  from domtree
	 where parent(node) is null

	union all

	select node, cte.hierarchy || (cte.n+1)::text::ltree, n+1
	  from domtree t
	  join cte on parent(t.node) = cte.node
)
select *
  from cte
 order by hierarchy;

Function parent() is made-up. It would return the parent node of a node, so that there is some way to connect the different parts in the hierarchy. I guess xpath() could fulfil that purpose, but I have no way of testing that hypothesis.

I hope that’s a good enough starting point for you?

Alban Hertroys
--
There is always an exception to always.