I have mostly been lurking on these lists over the past year and I have learned a lot from the posts, so thanks very much to the free Java contributors for your part in the MKSearch project. A formal announcement follows, but I thought members of these lists (Mark Wielaard especially) may also be interested in some screen shots of our beta search engine running on Fedora Core 4 with Tomcat 5 in Firefox. https://svn.mkdoc.com/mksearch/doc/design/screenshots/ Best regards, Phil Shaw MKSearch beta 1 release announcement MKDoc Ltd. would like to announce the first beta release of MKSearch, under the GNU General Public Licence. Source and pre-compiled binary downloads are available from the project Web site. http://www.mksearch.mkdoc.org/downloads/ MKSearch is a metadata search engine that indexes structured metadata in Web documents, not free text in the document body. The data acquisition system: * Conforms to the Dublin Core metadata in HTML recommendations [1] * Supports other application profiles, such as the UK e-Government Metadata Standard [2] * Indexes native RDF formats, including RSS 1.0 The MKSearch system has five major components: 1. A Web crawler based on JSpider [3] * Multi-threaded processing * Per-site throttle, user agent, depth and linking rules * Respects the robots.txt exclusion policy * Extensible plug-in based content handling 2. An HTML document validator and formatter based on JTidy [4] * Cleans-up and corrects HTML syntax errors * Converts HTML to XHTML 3. A set of custom indexers based on the Simple API for XML (SAX) * Extracts metadata from HTML meta and link elements * Converts metadata to RDF triple statements * Configurable application profiles 4. An RDF storage and query system based on Sesame [5] * XML/RDF file-based storage * Database storage using PostgreSQL or MySQL * Sophisticated Sesame RDF Query Language (SeRQL) queries * Scope for more semantically rich queries with inferencing 5. A public query interface, provided through a standard servlet container * Simple, expandable query builder form * Configurable application profile-based presentation * Wildcard query handling * Phrase searches * Paged HTML results * Standing RSS results The two main elements of the MKSearch system can be used independently. The data acquisition system can be used to gather large quantities of metadata from the Web and store it as RDF. The query system can be used to provide a typical search engine-style interface to existing RDF content. The MKSearch beta 1 distribution includes sample configurations that crawl a Web site and create: * A mirror of the site on the local file system in valid XHTML * An RDF N-Triple record for each page on the local file system * UK e-Government metadata in a Sesame file-based repository (XML/RDF) This distribution also includes a demonstration of the MKSearch query interface, in the form of a Web Application Archive (WAR) that can be deployed directly to an existing servlet container. The sample search content is from an index of the MKSearch project Web site on 2 November 2005. See the site documentation below: http://www.mksearch.mkdoc.org/documentation/tomcat-on-fc4/ http://www.mksearch.mkdoc.org/howto/ http://www.mksearch.mkdoc.org/plans/beta-1-release- tasks/mksearch-beta-1-release-notes/ System requirements and licence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MKSearch is written in the Java programming language and is designed to run on any platform that supports a Java environment equivalent to the Sun Java 2 language specification. The system has specifically been designed, developed and tested to run on GNU/Linux systems using the GNU Compiler for Java (GCJ) [6] and Apache Tomcat 5 servlet container, as available on Fedora Core 4 [7]. This provision means that MKSearch can be built and run on software systems that are entirely open source and free from proprietary licencing. The system has been tested extensively using the Sun Java SDK 1.5 on Microsoft Windows 2000. JUnit test suites for the MKSearch code base cover 99% of all code branches. If you have any comments or questions about the MKSearch system, please join us on the project mailing list. http://www.email-lists.org/mailman/listinfo/mksearch-dev References ~~~~~~~~~~ [1] http://dublincore.org/documents/2003/11/30/dcq-html/ [2] http://www.govtalk.gov.uk/schemasstandards/metadata_document. asp?docnum=805 [3] http://j-spider.sourceforge.net/ [4] http://jtidy.sourceforge.net/ [5] http://www.openrdf.org/ [6] http://gcc.gnu.org/java/ [7] http://fedora.redhat.com/ -- MKSearch (beta) http://www.mksearch.mkdoc.org/ Free, open source metadata search engine with RDF storage and query.