Re: Re: using UID in DB

Nathan Rixham <nrixham@xxxxxxxxx> · Wed, 31 Mar 2010 19:41:04 +0100

Tommy Pham wrote:
> On Wed, Mar 31, 2010 at 11:01 AM, Nathan Rixham <nrixham@xxxxxxxxx> wrote:
>> Tommy Pham wrote:
>>> On Wed, Mar 31, 2010 at 4:06 AM, Nathan Rixham <nrixham@xxxxxxxxx> wrote:
>>>> Tommy Pham wrote:
>>>>> As for spatial data types, I've never find much use for non scientific
>>>>> related.  (example) If using point as a PK, if MySQL stores it the
>>>>> same way as PostgreSQL which is 16 bytes, how is that any different -
>>>>> performance wise - than using UUID and storing it as binary(16) for
>>>>> MySQL or uniqueidentifier (16 bytes) for PostgreSQL?
>>>>>
>>>> it's all about the indexing (R-Tree)
>>>>
>>>> http://en.wikipedia.org/wiki/R-tree
>>>>
>>> I can see where the performance would be between B-Tree vs R-Tree for
>>> the same field size but I've yet to see real life application of it.
>>> Case in point, if using point for GPS data coordinates, then wouldn't
>>> it still be a lot better to use Lon (float), Lat (float) which is 2
>>> fields of 4 bytes each vs 1 field of 16 bytes?  The index would still
>>> be faster (8 bytes less) in B-Tree right?  Not to mention smaller row
>>> & DB size.
>> wish I still had an application to point you at; and I can't assert in
>> any other way just how much faster it is; especially over large datasets
>> - regardless of the amount of rows the style of index doesn't slow
>> (spatial) queries down.
> 
> If I see the PoC w/o having to see the actual setup, then I would
> probably understand your point better.
> 
>> if index byte size is of importance then negate spatial indexes (which
>> are rather a lot bigger).
>>
> 
> Since the spatial indexes are bigger, wouldn't that work against it
> due disk access having to read more?

depends on how seriously you take a solution tbh; if you just wanna run
a huge dataset on a sata drive w/ 2gb ram then your screwed whatever
approach (unless you turn the dataset in to many fine grained resources,
heavily cached and only accessed on a need to have basis) but even then
there are many limitations.

one approach is to throw more ram at a situation; where you are
definitely better loading a huge dataset in to one box with massive
amounts of ram than splitting it over multiple boxes. [1]

http://rickonrails.wordpress.com/2009/03/30/big-ole-mysql-spatial-table-optimization-tricks/

if you're really serious then you want to be on a box with a tonne of
ram and pci express solid state storage devices which give you
throughput of 700MB/s and higher; prices aren't that bad comparatively
and one machine w/ a $800 card added works much better than 2-3 servers
(although you still need to cover replication and high availability).

http://www.fusionio.com/ioxtreme/

wandering off course a bit maybe; but these are situations we need to
consider.

I don't know how big you think big is; but realistically aws ec2
instances coupled with various "big data" / nosql - non rdbms approaches
are the way to go (see many many many recent discussions on the subject
around the net)

>> regardless though, it was just an idea I was throwing at you, not
>> suggesting it's the best approach, but worth consideration depending on
>> your priorities.
>>
> 
> Thanks for your idea.  I'd appreciate it very much.  I'm told that I
> over analyze things at times ... lol ... so please don't take it the
> wrong way :)

ty for clarifying, and deal - I'm the same :)

>>  As for calculating the distance between 2 'points', it
>>> would be simpler for querying purposes to use 'point'.  ATM, I don't
>>> have need to do any such calculations.  If I do use 'point' for PK,
>>> I'd run into problems with scaling and migration later.
>> the points don't matter tbh; or should i say specifying an x and a y
>> doesn't matter, because you can simply get by a single id, or get a
>> range of ids very quickly; the other value comes in to play when you
>> want temporal queries.
>>
>> again though, i reiterate only useful if you want speed and don't care
>> about index bytesize - and certainly not forcing it down your neck.
>>
>> re scaling.. unsure theoretically you have 15 digit precision which is a
>> mysql DOUBLE/REAL, and that's for each value in the pair, so an almost
>> incomprehensible number of rows are possible before a collision.
>>
>> all in all: there are many good reasons why I've only used this on
>> occassion (specifically when dealing with 1 million+ rows in an RDBMS
>> table); and many more good reasons why i stick to non-rdbms databases
>> and use http uri's + eav model data only nowadays.
>>
>> The latter i would advocate, the former not so unless you are stuck w/
>> mysql postgres - in which case you can't get faster. [1]
>>
>> [1] get some numbers to prove when i have time.
>>
>> Nathan
>>
> 
> Using HTTP URI's + EAV model is something I'll definitely look into.
> 

good good; i can assure you it's the future; all the major corps and
governments are moving to this (age-old) model under the banner of
linked data, odata, and partially gdata - and meanwhile the development
community is slowly getting there with focus moving to kv databases
which are just one step away from the full on eav / triple approach.

Many regards,

Nathan

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php