Re: CSS & tables

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



tedd wrote:
At 8:52 PM +0100 5/17/09, Nathan Rixham wrote:
semantics already are the next big thing and have been for a year or three. google aquired the leading semantic analysis software many years ago and have been using it ever since, likewise with yahoo and all the majors. further we've all had open access to basic scripts like the yahoo term extraction service for years, and more recently (well maybe 2+ years) we've had access to open calais from reuters which will extract some great semantics from any content.

if you've never seen then the best starting point is probably http://viewer.opencalais.com/

Nathan:

You are always doing this to me -- you're as bad a Rob (but I can usually understand Rob). You guys make my head hurt. It would be nice if I could learn something and that was the end of it. But noooo -- very time I think I've learned something, you people point out my ignorance and keep dragging me back in to learn more -- when will it end? (rhetorical) </rant>.

From what I see, the link you provided will create tool-tips for terms and phrases found in text you provide. For example, if you have "web-standards" in your text, then it will show a tool-tip of "Industry Term: web standards", which is kind of redundant and obvious, don't you think?

Your text can also contain the terms "accessibility", "compliance", and even "W3C" but none of those will be identified. So, what's the big deal?

Has this three year old "state-of-the-art" technology advanced so far that it can identify "web standards" but fails on "accessibility", "compliance", and "W3C"?

I don't see the point -- please enlighten me.


pretty sure yahoo (and maybe google) have been parsing rdf semantic data embedded inside comments in xhtml documents for a couple of years now, even the adding of "tags" generated by semantic extraction are common place now and make a big difference to seo.

I can understand XML and maybe everyone will agree on a common namespace for these "Industry Terms" someday, but I do not see the connection between this and SEO. Do you think that because Google *may* be doing this in some fashion that you can duplicate their efforts and gain PR for your site? If so, I think the effort you expend may exceed just attending to content and letting Google do it's thing. But, I'm simple that way -- I would rather walk around the mountain than move it.

If however you mean document structure semantics such as using h* tags throughout the document in the correct places, then this is even older and everybody should be doing it - hell that's what an html document is!

That's not what I was talking about. I'm not talking about html tags but rather simple semantic divs for things like header, footer, content and such. It would be nice if everyone *was* doing it, but that's not the case.

In any event, your semantic thing appears more interesting and I suspect there's more to follow. Jut wait a moment, while I empty my head of useless childhood memories and await the onslaught of new things to consider. :-)


with open calais you'll find that it is more tailored to extracting business-centric information such as company names, peoples names, places, addresses, telephone numbers, quotes, hot topics, events, commercial products etc, rather than generic terms. The document viewer I linked you to however, did not display the full rdf info returned, it can be pretty impressive - often you can run an article through it and it'll be able to tell you that x person said such and such on date y at place z.

then consider yahoo term extraction, which extracts generic and common terms (keywords, keyphrases) from bodies of text.

now lets say you created a simple blog without any categories or tags or anything, then you could simply run all your content through open calais and yahoo term extraction and use the returned values to correlate related articles and automatically tag all your blogs. You could also preg_replace X% of the semantic extracts and terms found in your article to other articles on site.

Further you can re-inject the terms back in to the content in cunning but useful ways and further optimise your output, couple this with the output of the tags ont he page and the titles of the related articles and well.. you end up with what is effectively a perfectly optimised page and an auto associating context aware website.

I built a few versions of some systems to do this in the IM sector a couple of years ago, well spent 2005-2008 doing pretty much exclusively this and seeing how far one could take it and the results were rather astonishing.

here are some rough details on implementations I made:

related affiliate products
one spider crawled affiliate sites such as clickbank, then on tot he "landing page" of the affiliate offer, extracted the main content, split it in to chunks (anchor text, paragraphs, h* tags, titles, paragraphs etc) then ran it through the analysers. this created a database of chunks of text and well written link text that was associated to a tonne of semantic data.

then another spider ran over all "our" websites content, did the same thing, then it linked up the semantics and inject in affiliate links, as titles, urls, links on images etc (even the images were auto found using the data) - now this created "real" links, like a human would make, were half a sentence or more in the correct place was linked, and it looked like a human had done it - not just a half related advert in a block at the side of the page.

another (rather naughty one - but legal)
was to take the six apart atom stream, which publishes every post (written on typepad, livejournal, vox and a tonne of other sites) the instant it is posted - parse this stream in realtime, extract the article, run it through the semantic analysers and a couple of other tools, "enhance" with seo and ads then publish ping and post - all within a split second. This was very very very funny, because the site(s) were publishing so much content before the bots could get to the original source, it was considered to be the original source of all this content - and six apart terms completely allowed it - to further enhance I had several hundred blogs each with set "topics" and terms, so the content got published on a site which was about what user had written about. Each site that launched was bringing in circa 1500 unique visitors a day within 3 weeks, and peaked around 3000.

I'll stop now as work to do and I'm going on - there is a whole lot you can do with this technology - one thing i didn't touch on is that you can embed semantic rdf data in to xhtml pages to provide context to your text and this does make a big difference.

anyhow - just some random food for thought - if you want any "real" info rather than a ramble just let me know and I can sort you out some very interesting links and give you some source to play with. I've actually got the 3 year system sitting doing nothing - ate all my savings making it and had to get back to paid work, never completed it 100% (but it is production runnable, just more i wanted to do to it) - might have to get back on with it at some point soon!

hows the childhood memories?

regards!

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux