Search Postgresql Archives

Modeling combinations (options and dependencies)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello. I am struggling for the longest time about data I can't fit into a satisfying model. I thought about giving up about this project many times...
Context: it's for a web server so speed is of the essence. It's an amateur project so expect cheap server. I am also more a novice than a seasoned user so please have mercy on me.

Here we go, I hope I will be able to correctly explain my problem because chatgpt failed spectacularly to suggest anything even remotely worthwhile (yes, pretty desperate move, I know, but I am a desperate man).

-------------------------------------
DATA
-------------------------------------

-You have objects that are linked to many tables describing their properties, such as an object__tag table.
-These objects are sold in releases.
-Thing is, objects can be made of objects. I call them superobjects when they assume that container role (as objects, they can be part of other superobjects). Multiple releases can be linked to one superobject. Yes, it's a parts scenario.
-Objects can be shuffled in all sorts of unpredictable ways so it's useless to think that if one object is contained inside some superobject A, if that superobject A will be contained inside another superobject B, superobject B will inherit all objects from superobject A. In fact, my attempt at a solution will instantiate objects to link them to superobjects.
-In a superobject, multiple objects can be grouped together into a set of options. One option can be made of multiple objects (majority of the time, just one). You can choose how many options minimum and maximum (aka a range) you can choose from a set (majority of the time, just one). It's possible to have a NONE choice aka you don't choose any option.
-Any object can have 0 (common), 1 (common), or more (very rare) dependencies.

That's pretty much all.
It looks like a hierarchy but it is not really one. It looks like a graph but probably not one either.
It's made of multiple concepts that, alone, are difficult enough already, but now they are all grouped together and it's going well over my head.
In the end, I think it's about combinations with a bit of hierarchy to make everything more painful...

Here is a simple "tree" representation of that data. Think of "( )" as radio buttons. Empty linebreaks are separating sets of options. X indicates mandatory nodes (unless their parent is not chosen). "(n-m)" represents the min-max range for a set of options.

(x) Object 0 (it's the superobject)

(1-1) Object A1
(1-1) Object A2
(1-1) Object A3

(x) Object B1

(1-1) Object C1
│ (x) Object D1
│
│ (0-1) Object E1
│ (0-1) Object E2 + Object E3
│
│ (1-2) Object F1
│ (1-2) Object F2 {requires A2}
│
(1-1) Object C2
│ (x) Object G1
│
│ (x) Object E1

As you can see, a user can choose between three options A, E (with the second option being made of two objects and the third option making this set optional), two options C, F. If they choose object C2, they gets objects G1 and E1 by default.
Choosing object F2 forces you to have both objects C1 and A2. It's ugly to represent constraints like that but impossible to draw in 2D properly.

The reason why I said that it wasn't really a hierarchy is that you have three groups at "depth 1": object 0, object B, and objects C with their children. These three groups are all equal to each others and their position could be swapped with each other. Same with the children of objects C. Groups D, E, F, could be swapped with each other.

When you buy a release, you have to go from the top to the bottom and go through every branch until you reach its end.
Here, you could buy objects 0, A2, B1, C1, D1, E2, E3, F2.
Or 0, A1, B1, C1, D1, F1.
Or 0, A3, B1, C2, G1, E1.

Superobjects don't have much depth to begin with, depth 6 max.
Sets of options per superobject won't go probably above 10, but options themselves can go pretty high like 50.
Most objects don't have dependencies to tell you. The average superobject is made of a bunch of mandatory objects and few sets of options with few options (half of them probably being if you want the object or not).
I know it's vague but the data is unpredictable with lot of exceptions and that's why I need flexibility and can't hardcode anything.

-------------------------------------
PURPOSE
-------------------------------------

I don't really care at the moment as of how to render that tree above as it doesn't appear to be a big challenge. What I care is building a performant search engine to find superobjects (and releases above them) given columns of objects (subqueries) selected through joining objects with their related tableS such as object__tag. You have to find superobjects you can buy, while respecting all those annoying constraints. So, if you search for a superobject containing E2 and G1, you cannot find superobject 0 because it's on different branches! Same, if you search for A1 and A2, you cannot find superobject 0 because they are in the same set of options and choosing one prevent you from choosing the other!

-------------------------------------
BAD SOLUTION 1
-------------------------------------

My first thought was to create a "hierarchy". You put the three groups at depth 1 one after the other. The order doesn't matter. What matters is to have all possibilities the one after the other. One path of such tree would be 0-A1-B1-C1-D1-E2-E3-F1. It would be a tree representing all possible combinations.

The problem is that such a tree grows exponentially.
If you have 3 options, then 4, then 5, the tree will have 60 paths. If you add another one of 3 it's 180. I estimated that you have some freak superobjects that could spawn more than 1000 possibilities. It's insane just to check few sets of options! I just need some freak exceptions I never imagined would exist to clog my database, meaning I would have to limit the number of combinations a superobject could have to be on the safe side.
And how do you even begin being efficient if it's to throw at such a hierarchy four columns of object.id to find paths that hit a combination?? If I have to check recursively hundreds of paths that do not even have a predictable order just to find a list of superobjects, I don't see my website becoming very performant.

I read a quite a "bit" on hierarchies but couldn't find anything useful. Most stuff is focused on basic cases such as finding parents or children, shortest path, subtrees, but never combinations.

I also tried to flatten all of that and dived into ltree, arrays and jsonb, thinking I could use somehow that interesting containment property, but it was before finally getting that you can't even throw columns at such types.

-------------------------------------
(NOT COMPLETELY?) BAD SOLUTION 2
-------------------------------------

It's the one I am going for at the moment (given it's the only one I found, on my own...) but it's a wasteful solution. It takes a lot of place and every time you want to throw another column of objects to the query, it becomes an order of magnitude slower because the progressions of relations (and therefore joins) follows a triangle sequence (1, 3, 6, 10, 15, 21, etc.) as you will see.

I hesitated before explaining my solution because I don't want to divert the discussion about trying to build over it if it's to ignore a better solution, but it's better to show that I did some homework before asking for a solution.

CREATE TABLE object (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY,
CONSTRAINT object__pk PRIMARY KEY (id)
);

CREATE TABLE superobject (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY,
object_id integer NOT NULL,
CONSTRAINT superobject_pk PRIMARY KEY (id)
);
ALTER TABLE superobject ADD CONSTRAINT superobject__object_id_fk FOREIGN KEY (object_id) REFERENCES object (id);

CREATE TABLE release (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY,
superobject_id integer NOT NULL,
CONSTRAINT release_pk PRIMARY KEY (id)
);
ALTER TABLE release ADD CONSTRAINT release__superobject_id_fk FOREIGN KEY (superobject_id) REFERENCES superobject (id);

CREATE TABLE instance (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY ,
object_id integer NOT NULL,
CONSTRAINT instance__pk PRIMARY KEY (id,object_id)
);
ALTER TABLE instance ADD CONSTRAINT instance__object_id_fk FOREIGN KEY (object_id) REFERENCES object (id);

CREATE TABLE pairing (
superobject_id integer NOT NULL,
instance_a integer NOT NULL,
instance_b integer NOT NULL,
CONSTRAINT pairing__pk PRIMARY KEY (superobject_id,instance_a,instance_b)
);
ALTER TABLE pairing ADD CONSTRAINT pairing__superobject_id_fk FOREIGN KEY (superobject_id) REFERENCES superobject (id);
ALTER TABLE pairing ADD CONSTRAINT pairing__instance_a_fk FOREIGN KEY (instance_a) REFERENCES instance (id);
ALTER TABLE pairing ADD CONSTRAINT pairing__instance_b_fk FOREIGN KEY (instance_b) REFERENCES instance (id);

Here is a visual representation of my schema if it can help but be aware that it doesn't match exactly what I wrote above as it's a stripped down version:
https://i.imgur.com/AUau65W.png

In the table pairing, you will add for every object which object it is linked to, no matter the depth.
If one has no dependency, it will have a row for every object.
If one is an option, it won't have a row linking it to its fellow options or any of their sub-options.

Example (instances a and b are replaced by the names of the objects for clarity):
(sorry, can't format that properly in a mail)

superobject_instance_a_instance_b
0___________0__________A1
0___________0__________A2
0___________0__________A3
0___________0__________B1
0___________0__________C1
0___________0__________D1
0___________0__________E1
0___________0__________E2
0___________0__________E3
0___________0__________F1
0___________0__________F2
0___________0__________C2
0___________0__________G1
0___________0__________E1
0___________A1_________0
0___________A1_________B1
0___________A1_________C1
0___________A1_________D1
0___________A1_________E1
0___________A1_________E2
0___________A1_________E3
0___________A1_________F1
0___________A1_________F2
0___________A1_________C2
0___________A1_________G1
0___________A1_________E1
0___________A2_________0
0___________A2_________B1
0___________A2_________C1
0___________A2_________D1
0___________A2_________E1
0___________A2_________E2
0___________A2_________E3
0___________A2_________F1
0___________A2_________F2
0___________A2_________C2
0___________A2_________G1
0___________A2_________E1
0___________A3_________0
...
0___________B1_________0
...
0___________C2_________0
0___________C2_________A1
0___________C2_________A2
0___________C2_________A3
0___________C2_________B1
0___________C2_________G1
0___________C2_________E1
0___________C1_________0
0___________C1_________A1
0___________C1_________A2
0___________C1_________A3
0___________C1_________B1
0___________C1_________D1
0___________C1_________E1
0___________C1_________E2
0___________C1_________E3
0___________C1_________F1
0___________C1_________F2
0___________D1_________0
...
etc.

As you can see, it is super wasteful as you have the same relation in both directions. 0-A1 and A1-0 are identical given that this table has two columns semantically identical. You could search in one (instance_a -> instance_b) or the other direction (instance_b -> instance_a) and it would be the same. I asked on dba.stackexchange (https://dba.stackexchange.com/questions/326860/bidirectional-self-join-table) if you could split the table in half but I don't think it's possible.

The benefit is that it doesn't grow exponentially. Sure, it's closer to a quartesian product (total_of_objects * total_of_objects-1 when no set of options involved) but at least it stays "reasonable".

Instantiating is mandatory because if you don't make every object of a tree unique, you will get collisions. With paths A-B-C, E-B-D, D will be linked to A through A-B and B-D even though they are not. Instantiating B will prevent that.

As for the queries:

--FIND SUPEROBJECTS GIVEN 1 OBJECT (super fast)
SELECT superobject_id FROM pairing
WHERE instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
GROUP BY superobject_id;

--FIND SUPEROBJECTS GIVEN 2 OBJECTS (super fast)
SELECT superobject_id FROM pairing
WHERE instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
AND instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <second_criterion>)
GROUP BY superobject_id;

--FIND SUPEROBJECTS GIVEN 3 OBJECTS (fast?)
SELECT superobject_id
FROM (SELECT p1.superobject_id, p1.instance_b AS b, p2.instance_b AS c FROM pairing p1
JOIN pairing p2 ON p1.superobject_id = p2.superobject_id and p1.instance_a = p2.instance_a
WHERE p1.instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
AND p1.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <second_criterion>)
AND p2.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <third_criterion>)
intersect
SELECT superobject_id, instance_a AS b, instance_b AS c FROM pairing) AS results
GROUP BY superobject_id

It's basically a triangle.
<->A<->B<->C<->

--FIND SUPEROBJECTS GIVEN 4 OBJECTS (uh oh?)
SELECT first_pass.superobject_id FROM (SELECT p1.superobject_id, p1.instance_b AS b, p2.instance_b AS c, p3.instance_b AS d FROM pairing p1
JOIN pairing p2 ON p1.superobject_id = p2.superobject_id and p1.instance_a = p2.instance_a
JOIN pairing p3 ON p1.superobject_id = p3.superobject_id and p1.instance_a = p3.instance_a
WHERE p1.instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
AND p1.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <second_criterion>)
AND p2.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <third_criterion>)
AND p3.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <fourth_criterion>)) AS first_pass
JOIN pairing p4 ON first_pass.superobject_id = p4.superobject_id and first_pass.b = p4.instance_a and first_pass.c = p4.instance_b
JOIN pairing p5 ON first_pass.superobject_id = p5.superobject_id and first_pass.b = p5.instance_a and first_pass.d = p5.instance_b
JOIN pairing p6 ON first_pass.superobject_id = p6.superobject_id and first_pass.c = p6.instance_a and first_pass.d = p6.instance_b
GROUP BY first_pass.superobject_id

Here you have a square and its two diagonales.
I am not sure it's the best way to do it.

--FIND SUPEROBJECTS GIVEN 5 OBJECTS (???)
something something NINE JOINS
A pentacle and it's pentagram of joins...

As you can see, it is already quite the beast just to find a superobject made of 4 subqueries. It doesn't look sustainable and is quite infuriating to have to build such intricate polygons from small instance_a-instance_b pieces when the majority of superobjects will be flat.

Anyway, I hope to find some solace in your hands. ヾ(_ _*)
Feel free to ask any question if something is not clear.






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux