tedd wrote:
At 8:52 PM +0100 5/17/09, Nathan Rixham wrote:
semantics already are the next big thing and have been for a year or
three. google aquired the leading semantic analysis software many
years ago and have been using it ever since, likewise with yahoo and
all the majors. further we've all had open access to basic scripts
like the yahoo term extraction service for years, and more recently
(well maybe 2+ years) we've had access to open calais from reuters
which will extract some great semantics from any content.
if you've never seen then the best starting point is probably
http://viewer.opencalais.com/
Nathan:
You are always doing this to me -- you're as bad a Rob (but I can
usually understand Rob). You guys make my head hurt. It would be nice if
I could learn something and that was the end of it. But noooo -- very
time I think I've learned something, you people point out my ignorance
and keep dragging me back in to learn more -- when will it end?
(rhetorical) </rant>.
From what I see, the link you provided will create tool-tips for terms
and phrases found in text you provide. For example, if you have
"web-standards" in your text, then it will show a tool-tip of "Industry
Term: web standards", which is kind of redundant and obvious, don't you
think?
Your text can also contain the terms "accessibility", "compliance", and
even "W3C" but none of those will be identified. So, what's the big deal?
Has this three year old "state-of-the-art" technology advanced so far
that it can identify "web standards" but fails on "accessibility",
"compliance", and "W3C"?
I don't see the point -- please enlighten me.
pretty sure yahoo (and maybe google) have been parsing rdf semantic
data embedded inside comments in xhtml documents for a couple of years
now, even the adding of "tags" generated by semantic extraction are
common place now and make a big difference to seo.
I can understand XML and maybe everyone will agree on a common namespace
for these "Industry Terms" someday, but I do not see the connection
between this and SEO. Do you think that because Google *may* be doing
this in some fashion that you can duplicate their efforts and gain PR
for your site? If so, I think the effort you expend may exceed just
attending to content and letting Google do it's thing. But, I'm simple
that way -- I would rather walk around the mountain than move it.
If however you mean document structure semantics such as using h* tags
throughout the document in the correct places, then this is even older
and everybody should be doing it - hell that's what an html document is!
That's not what I was talking about. I'm not talking about html tags but
rather simple semantic divs for things like header, footer, content and
such. It would be nice if everyone *was* doing it, but that's not the case.
In any event, your semantic thing appears more interesting and I suspect
there's more to follow. Jut wait a moment, while I empty my head of
useless childhood memories and await the onslaught of new things to
consider. :-)
with open calais you'll find that it is more tailored to extracting
business-centric information such as company names, peoples names,
places, addresses, telephone numbers, quotes, hot topics, events,
commercial products etc, rather than generic terms. The document viewer
I linked you to however, did not display the full rdf info returned, it
can be pretty impressive - often you can run an article through it and
it'll be able to tell you that x person said such and such on date y at
place z.
then consider yahoo term extraction, which extracts generic and common
terms (keywords, keyphrases) from bodies of text.
now lets say you created a simple blog without any categories or tags or
anything, then you could simply run all your content through open calais
and yahoo term extraction and use the returned values to correlate
related articles and automatically tag all your blogs. You could also
preg_replace X% of the semantic extracts and terms found in your article
to other articles on site.
Further you can re-inject the terms back in to the content in cunning
but useful ways and further optimise your output, couple this with the
output of the tags ont he page and the titles of the related articles
and well.. you end up with what is effectively a perfectly optimised
page and an auto associating context aware website.
I built a few versions of some systems to do this in the IM sector a
couple of years ago, well spent 2005-2008 doing pretty much exclusively
this and seeing how far one could take it and the results were rather
astonishing.
here are some rough details on implementations I made:
related affiliate products
one spider crawled affiliate sites such as clickbank, then on tot he
"landing page" of the affiliate offer, extracted the main content, split
it in to chunks (anchor text, paragraphs, h* tags, titles, paragraphs
etc) then ran it through the analysers. this created a database of
chunks of text and well written link text that was associated to a tonne
of semantic data.
then another spider ran over all "our" websites content, did the same
thing, then it linked up the semantics and inject in affiliate links, as
titles, urls, links on images etc (even the images were auto found using
the data) - now this created "real" links, like a human would make, were
half a sentence or more in the correct place was linked, and it looked
like a human had done it - not just a half related advert in a block at
the side of the page.
another (rather naughty one - but legal)
was to take the six apart atom stream, which publishes every post
(written on typepad, livejournal, vox and a tonne of other sites) the
instant it is posted - parse this stream in realtime, extract the
article, run it through the semantic analysers and a couple of other
tools, "enhance" with seo and ads then publish ping and post - all
within a split second. This was very very very funny, because the
site(s) were publishing so much content before the bots could get to the
original source, it was considered to be the original source of all this
content - and six apart terms completely allowed it - to further enhance
I had several hundred blogs each with set "topics" and terms, so the
content got published on a site which was about what user had written
about. Each site that launched was bringing in circa 1500 unique
visitors a day within 3 weeks, and peaked around 3000.
I'll stop now as work to do and I'm going on - there is a whole lot you
can do with this technology - one thing i didn't touch on is that you
can embed semantic rdf data in to xhtml pages to provide context to your
text and this does make a big difference.
anyhow - just some random food for thought - if you want any "real" info
rather than a ramble just let me know and I can sort you out some very
interesting links and give you some source to play with. I've actually
got the 3 year system sitting doing nothing - ate all my savings making
it and had to get back to paid work, never completed it 100% (but it is
production runnable, just more i wanted to do to it) - might have to get
back on with it at some point soon!
hows the childhood memories?
regards!
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php