On Wed, 2006-04-12 at 05:47 -0700, Mike Stankovic wrote: > --- Paul <subsolar@xxxxxxxxxxxx> wrote: > > > On Tue, 2006-04-11 at 06:55 -0700, Mike Stankovic > > wrote: > > > I've got about 10,000 docs I'd like to devise a > > > search/index for. I found a perl script called > > > Perlfect that can do that on an old P3 but at the > > > astronomical time of 7 hours. Another > > script(cgi/perl) > > > at hotscripts can do the same but allows the "rm > > -rf > > > /" exploit. DoH!? > > > > > > Is there anything perl/flatfile that can > > search/index > > > faster? This is a nice job for an aging P3 in > > the > > > corner so php/MySQL is not an option. Don't > > suggest > > > beagle/windows solutions as this is a CentOS 4.3 > > system. > > > > Well at work we have an archive of ~ 12K PDFs that > > engineering uses for > > process documentations and I use Swish-e > > (http://swish-e.org/) to index > > it so that they can search it. The server it sits > > on is a PIII 733 with > > 512MB RAM and it takes about 90 minutes to re-index > > them every night. > > > > It works well for us as it allows AND & OR > > operators, searches for > > phrases and other fairly advanced features. > > > > The main limitation is that you need a filter to > > convert whatever the > > document is to one of the following: text, html or > > xml so it can be > > indexed. > > > > Regards, > > Paul Berger > > > > > __________________________________________________ > > > Improve the mailing list by performing a simple > > search > > > before posting and reading the faq/etiquette. > > > Thank you!! > > > > > > __________________________________________________ > > > Do You Yahoo!? > > > Tired of spam? Yahoo! Mail has the best spam > > protection around > > > http://mail.yahoo.com > > > _______________________________________________ > > > CentOS mailing list > > > CentOS@xxxxxxxxxx > > > http://lists.centos.org/mailman/listinfo/centos > > > > > > > _______________________________________________ > > CentOS mailing list > > CentOS@xxxxxxxxxx > > http://lists.centos.org/mailman/listinfo/centos > > > > Yes Swish-e is in dag's repo and appears to be > supported upstream very well. I was right about > htsearch it is one of the components of htdig (also > available in rpm format). > > Does it have issues with charsets that are not Latin-1 > (ISO-8859-1) or plain 7bit ASCII ? I don't know off hand ... I found the following in the Swish-e FAQ... http://swish-e.org/devel/devel_docs/swish-faq.html How do I index non-English words? Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1 character set, and includes many non-English letters (and symbols). As long as they are listed in WordCharacters they will be indexed. Actually, you probably can index any 8-bit character set, as long as you don't mix character sets in the same index and don't use libxml2 for parsing (see below). The TranslateCharacters directive (SWISH-CONFIG) can translate characters while indexing and searching. You may specify the mapping of one character to another character with the TranslateCharacters directive. TranslateCharacters :ascii7: is a predefined set of characters that will translate eight-bit characters to ascii7 characters. Using the :ascii7: rule will, for example, translate "???" to "aac". This means: searching "?elik", "?elik" or "celik" will all match the same word. Note: When using libxml2 for parsing, parsed documents are converted internally (within libxml2) to UTF-8. This is converted to ISO 8859-1 Latin-1 when indexing. In cases where a string can not be converted from UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters), the string will be sent to Swish-e in UTF-8 encoding. This will results in some words indexed incorrectly. Setting ParserWarningLevel to 1 or more will display warnings when UTF-8 to 8859-1 conversion fails. > > __________________________________________________ > Improve the mailing list by performing a simple search > before posting and reading the faq/etiquette. > Thank you!! > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > CentOS mailing list > CentOS@xxxxxxxxxx > http://lists.centos.org/mailman/listinfo/centos >