Re: CSS & tables

Nathan Rixham <nrixham@xxxxxxxxx> · Mon, 18 May 2009 17:14:34 +0100

tedd wrote:
At 8:52 PM +0100 5/17/09, Nathan Rixham wrote:
semantics already are the next big thing and have been for a year or 
three. google aquired the leading semantic analysis software many 
years ago and have been using it ever since, likewise with yahoo and 
all the majors. further we've all had open access to basic scripts 
like the yahoo term extraction service for years, and more recently 
(well maybe 2+ years) we've had access to open calais from reuters 
which will extract some great semantics from any content.

if you've never seen then the best starting point is probably 
http://viewer.opencalais.com/

Nathan:

You are always doing this to me -- you're as bad a Rob (but I can 
usually understand Rob). You guys make my head hurt. It would be nice if 
I could learn something and that was the end of it. But noooo -- very 
time I think I've learned something, you people point out my ignorance 
and keep dragging me back in to learn more -- when will it end? 
(rhetorical) </rant>.

 From what I see, the link you provided will create tool-tips for terms 
and phrases found in text you provide. For example, if you have 
"web-standards" in your text, then it will show a tool-tip of "Industry 
Term: web standards", which is kind of redundant and obvious, don't you 
think?

Your text can also contain the terms "accessibility", "compliance", and 
even "W3C" but none of those will be identified. So, what's the big deal?

Has this three year old "state-of-the-art" technology advanced so far 
that it can identify "web standards" but fails on "accessibility", 
"compliance", and "W3C"?

I don't see the point -- please enlighten me.

pretty sure yahoo (and maybe google) have been parsing rdf semantic 
data embedded inside comments in xhtml documents for a couple of years 
now, even the adding of "tags" generated by semantic extraction are 
common place now and make a big difference to seo.

I can understand XML and maybe everyone will agree on a common namespace 
for these "Industry Terms" someday, but I do not see the connection 
between this and SEO. Do you think that because Google *may* be doing 
this in some fashion that you can duplicate their efforts and gain PR 
for your site? If so, I think the effort you expend may exceed just 
attending to content and letting Google do it's thing. But, I'm simple 
that way -- I would rather walk around the mountain than move it.

If however you mean document structure semantics such as using h* tags 
throughout the document in the correct places, then this is even older 
and everybody should be doing it - hell that's what an html document is!

That's not what I was talking about. I'm not talking about html tags but 
rather simple semantic divs for things like header, footer, content and 
such. It would be nice if everyone *was* doing it, but that's not the case.

In any event, your semantic thing appears more interesting and I suspect 
there's more to follow. Jut wait a moment, while I empty my head of 
useless childhood memories and await the onslaught of new things to 
consider.  :-)

with open calais you'll find that it is more tailored to extracting 
business-centric information such as company names, peoples names, 
places, addresses, telephone numbers, quotes, hot topics, events, 
commercial products etc, rather than generic terms. The document viewer 
I linked you to however, did not display the full rdf info returned, it 
can be pretty impressive - often you can run an article through it and 
it'll be able to tell you that x person said such and such on date y at 
place z.

then consider yahoo term extraction, which extracts generic and common 
terms (keywords, keyphrases) from bodies of text.

now lets say you created a simple blog without any categories or tags or 
anything, then you could simply run all your content through open calais 
and yahoo term extraction and use the returned values to correlate 
related articles and automatically tag all your blogs. You could also 
preg_replace X% of the semantic extracts and terms found in your article 
to other articles on site.

Further you can re-inject the terms back in to the content in cunning 
but useful ways and further optimise your output, couple this with the 
output of the tags ont he page and the titles of the related articles 
and well.. you end up with what is effectively a perfectly optimised 
page and an auto associating context aware website.

I built a few versions of some systems to do this in the IM sector a 
couple of years ago, well spent 2005-2008 doing pretty much exclusively 
this and seeing how far one could take it and the results were rather 
astonishing.

here are some rough details on implementations I made:

related affiliate products
one spider crawled affiliate sites such as clickbank, then on tot he 
"landing page" of the affiliate offer, extracted the main content, split 
it in to chunks (anchor text, paragraphs, h* tags, titles, paragraphs 
etc) then ran it through the analysers. this created a database of 
chunks of text and well written link text that was associated to a tonne 
of semantic data.

then another spider ran over all "our" websites content, did the same 
thing, then it linked up the semantics and inject in affiliate links, as 
titles, urls, links on images etc (even the images were auto found using 
the data) - now this created "real" links, like a human would make, were 
half a sentence or more in the correct place was linked, and it looked 
like a human had done it - not just a half related advert in a block at 
the side of the page.

another (rather naughty one - but legal)
was to take the six apart atom stream, which publishes every post 
(written on typepad, livejournal, vox and a tonne of other sites) the 
instant it is posted - parse this stream in realtime, extract the 
article, run it through the semantic analysers and a couple of other 
tools, "enhance" with seo and ads then publish ping and post - all 
within a split second. This was very very very funny, because the 
site(s) were publishing so much content before the bots could get to the 
original source, it was considered to be the original source of all this 
content - and six apart terms completely allowed it - to further enhance 
I had several hundred blogs each with set "topics" and terms, so the 
content got published on a site which was about what user had written 
about. Each site that launched was bringing in circa 1500 unique 
visitors a day within 3 weeks, and peaked around 3000.

I'll stop now as work to do and I'm going on - there is a whole lot you 
can do with this technology - one thing i didn't touch on is that you 
can embed semantic rdf data in to xhtml pages to provide context to your 
 text and this does make a big difference.

anyhow - just some random food for thought - if you want any "real" info 
rather than a ramble just let me know and I can sort you out some very 
interesting links and give you some source to play with. I've actually 
got the 3 year system sitting doing nothing - ate all my savings making 
it and had to get back to paid work, never completed it 100% (but it is 
production runnable, just more i wanted to do to it) - might have to get 
back on with it at some point soon!

hows the childhood memories?

regards!

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php