Re: SQL - planet redundant data

"Jim C. Nasby" <jnasby@xxxxxxxxxxxxx> · Sat, 17 Sep 2005 19:31:11 -0500

On Sun, Sep 11, 2005 at 11:00:02PM -0800, Poul Jensen wrote:
> Tom Lane wrote:
> 
> >No, tableoid is sort of a virtual column ... it doesn't exist on disk.
> >When you query it you get a value fetched from the internal data
> >structure representing the table.
> > 
> >
> So virtual columns are possible - THIS is a way to clear redundant data! 
> Is it possible for a user to create a virtual column? If not, this would 
> make a big improvement.
> 
> What I really need are "partial virtual columns". I'm imagining an 
> alternative version of VACUUM ANALYZE that could do the following:
> 1) Order the rows in the table so that for each column, identical values 
> are placed next to each other for as far as possible (the row order that 
> optimizes one column will probably not be optimal for other columns).
> 2) For each column, identify the stretches that contain only one 
> distinct value. Save that value together with ID of start and end row 
> and delete stretch.
> It is not obvious how to do a perfect optimization process in 1), at 
> least not to me - I'm sure a skilled mathematician would know exactly 
> how to do it. But here's a simple approach that would get us part of the 
> way:
> 1.1) Grab the column w. most redundancy (fewest distinct values) and 
> sort it into groups according to the distinct values.
> 1.2) For each of these groups, grab the column w. next most redundancy 
> and sort into groups according to the distinct values.
> And so on. Stop whenever groups become so small that there's nothing to 
> gain.
> Such an analysis would make it much less expensive to combine 
> same-schema tables, and having everything in the same table is really 
> convenient. It would obviously save a lot of storage space, but I 
> imagine it would enable more efficient queries too - having to check 
> just 3 values instead of the thousands (or even millions) they may 
> replace must give a considerable gain.
> 
> 'What is the big benefit of not having ordered rows? I imagine it could 
> be a disadvantage for dynamic databases, but for a static database like 
> mine which won't be modified, except for maybe adding new data once a 
> year, I imagine an optimization including row ordering could be highly 
> beneficial.

Oracle supports something akin to this. On an Index Organized Table you
can tell it to pull part of the index key out of individual rows. IE:

CREATE TABLE (a, b, c, d, e, f)
    PRIMARY KEY(a, b, c, d)
    INDEX ORGANIZED COMBINE(a, b)
;

In this case, every time the combination of (a, b) changes, Oracle
stores a special record of some kind that indicates the change, and it
doesn't store a or b with each row. (Note that I'm guessing on syntax,
it's been a while since I've used Oracle).

> ##############################
> 
> Jim C. Nasby wrote:
> 
> >What you seem to be looking for is a form of partitioning. PostgreSQL
> >doesn't currently support partitioning of this form, but there's work in
> >progress to change that.
> > 
> >
> Any idea how far out in the future this is? Would it make the 
> optimization process described above (reply to Tom Lane) obsolete? Well, 
> maybe my ideas about an optimal solution just illustrate lack of 
> knowledge about SQL, but I'm hoping somebody can see what I'm trying to 
> suggest.

Well, the firslt important step is slated for 8.1; effective partition
elimination. You can search for more info, especially in the BizGres
archives. Next step is creating actual partitioning syntax to make
managing partitions easier.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@xxxxxxxxxxxxx
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq