Re: arrays of floating point numbers / linear algebra operations into the DB

Ted Byers <r.ted.byers@xxxxxxxxxx> · Sat, 2 Feb 2008 00:47:45 -0500 (EST)

--- Webb Sprague <webb.sprague@xxxxxxxxx> wrote:
> > >>>> ...linear algebra ...
> > >>> ... matrices and vectors .
> > >> ...Especially if some GIST or similar index
> could efficiently search
> > >> for vectors "close" to other vectors...
> > >
I see a potential problem here, in terms of how one
defines "close" or similitude.  I think, though,
practical answers can be found in examples of applying
quantitative methods in some subdisciplines of
biology.

> > > Hmm.  If I get some more interest on this list
> (I need just one LAPACK
> > > / BLAS hacker...), I will apply for a pgFoundry
> project and appoint
> > > myself head of the peanut gallery...
> >
Someone pointed to the potential utility of pl/R.  I
would be interested at least in learning about your
assessment of the two (postgis and pl/r.  Alas, I
don't have decent date I could use to experiment with
either (except possibly for time series analysis,
which is a completely different kettle of fish.
> 
> > and deal with a big database doing lots of
> similarity-based searches (a
> > 6'2" guy with light brown hair being similar to a
> 6'1" guy with dark
> > blond hair) - and am experimenting with modeling
> some of the data as
> > vectors in postgres.
> 
> Well,  I bet a good linear algebra library would
> help.  A lot. :)
> 

If you're looking at similarity, and some practicality
in the USE of quantitative procedures, you may want to
look into the biogeography and numerical taxonomy
literature, and to a lesser extent quantitative plant
ecology.  All three subdisciplines of biology have
decades of experience, and copious literature, looking
at similarity measures, and in my experience much more
practical or pragmatic than the 'pure' biostatistics
literature, and infinitely more practical than any
theoretical statistical or mathematical literature I
have seen (trust me, I have a bookcase full of this
"stuff").

A good linear algebra library would be useful, but
there are a lot of nonlinear analyses that would be of
interest; and there are nonparametric, yet
quantitative approaches that are of considerable
interest in assessing similarity.

I don't know of work looking at applying things like
discriminant functions analysis or cluster analysis or
any of the many ordination analyses that may be
considered to searches in a database, but then I
haven't looked at the question since I graduated.  I
am interested in the question, though, and would be
interested in hearing about your experience on the
question.  

If I can manage the time, I hope to start a project
where I can store description data for specimens of
plants and animals, use analyses including but not
limited to ordination, clustering, discriminant
functions, cannonical correlation, to create a
structure for comparing them, and for identifying new
specimens, or at a minimum, if the specimen is truly
something unknown, learn what known specimens or
groups thereof it is most similar to, and how it is
different.

I have managed to install pl/r, but I haven't had the
time to figure out how best to analyze data stored in
the database using it.  In the data I Do have, it
changes daily, and some of the tables are well over
100MB, so I am a bit worried about how well it can
handle such an amount of data, and how long it would
take.

Cheers,

Ted  

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
       message can get through to the mailing list cleanly