Re: Trees: integer[] outperformed by ltree

Merlin Moncure <mmoncure@xxxxxxxxx> · Wed, 6 Nov 2013 08:36:09 -0600

On Tue, Nov 5, 2013 at 6:27 PM, Jan Walter <john@xxxxxxxxxxxxxxxx> wrote:
> On 5.11.2013 23:19, Merlin Moncure wrote:
> On Tue, Nov 5, 2013 at 3:52 PM, Jan Walter <john@xxxxxxxxxxxxxxxx> wrote:
> On 5.11.2013 20:51, Merlin Moncure wrote:
> On Tue, Nov 5, 2013 at 11:30 AM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:
> On Tue, Nov 5, 2013 at 6:25 AM, Jan Walter <john@xxxxxxxxxxxxxxxx> wrote:
>
> I am in a need of a very robust (esp. fast in read, non-blocking in update)
> tree structure storage (95% trees are of depth <4, current max. is 12). We
> have 10k-100k trees now, millions in the future.
> I made many tests, benchmarks of usual operations, and after all,
> materialized path ('1.5.3' path notation) seems most promising.
>
> materialized path approaches tend to be ideal if the tree remains
> relatively static and is not too deep.  The downside with matpath is
> that if a you have to move a node around in the tree, then all the
> child elements paths' have to be expensively updated.  I bring this up
> as it relates to your 'non-blocking in update' requirement: in matpath
> an update to parent can update an unbounded number of records.
>
>
> Thanks for your remark.
> Materialized path is still better in updates than nested sets we are using
> currently.
> Although adjacency lists with recursive CTEs were initially my favorite
> substitution (smallest lock space for node relocation), whey are completely
> failing in e.g. order by path (depth) task (150s vs. 31ms via integer[]),
> and twice slower in simple descendant-based tasks.
> I am yet considering it (if I moved e.g. ordering to application server
> level), and still need to rewrite couple of more sophisticated scenarios to
> CTEs to be absolutely sure if it fails; anyway MP seems more promising.
> I also tend to have the tree structure completely independent on other data
> belonging to nodes.
>
> Or did you have any other model in your mind?
>
>
> hm, why do you need gist/gin for the int[] formulation?  what are your
> lookup requirements?  with int[] you can typically do contains with
> simple btree.
>
>
> I do not think so, i.e. neither my tests nor
> http://www.postgresql.org/docs/current/static/indexes-types.html showed it
> works for <@ or @>. Using it for <= operator puts btree index to query plan
> in some (not all) scenarios. Still it needs to be accompanied with <@, and
> the performance goes in more complex scenarios down.
>
> 'Start with' I was mentioning, would be ideal.
>
> This is trivial especially if you don't have to deal with null edge
> case.  If you wan to search btree'd array column for all elements
> starting with [1,2,3],
>
> SELECT * FROM foo
>   WHERE
>     id >= array[1,2,3]
>     AND id < array[1,2,4];
>
>
> You meant path >= array[1,2,3], right?

right -- I think you get the idea.

> Great for descendants. Almost the same performance (btree vs. gin) in simple
> cases.
> For ancestors, ids can be retrieved directly using unnest. I have to check
> how fast that is in real-life situations, as well as if all my cases can be
> solved without <@ operators (e.g. for descendants of filtered nodes, array
> manipulation has to be used).

That's the key.  Basically it comes down to this.  If all your
searches are anchored from root, then you can get away with simple
btree.  By 'anchored from root',  mean you never query like this:
path = [*,*,3,4,5];  -- where * are wild cards.

The above is only indexable by GIST/GIN.  If you need to do that and
your dataset is large i'd be looking at ltree.

The reason why this works is that mat path exploits a property of a
tree such that root based searches boil down to a a range of node with
unambiguous endpoints on an ordered table.  Celko's 'nested sets'
approach (which although it gets points for very basic SQL
requirements I don't recommend it due to very poor scalability) also
exploits this.

Another way to do matpath is via strings; then you can do prefix
operations with the SQL LIKE operator (path LIKE '1.2.3.%').  Yet
another way (which I've never tried but plan to eventually is via
numerics: see http://arxiv.org/html/cs/0401014).

> Still I do not understand the behavior of gin index, and I read your
> recommendation as don't use it.

I'm not saying that at all.  My point is though is that if your
requirements are satisfied by btree then use that.  GIST/GIN support a
much broader array of operations but that support does come at a
price.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance