Re: Fedora search

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02.11.2013 02:32, Michael Cronenworth wrote:
This will be my last mailing on this topic as I will not contribute or
use this feature in Fedora, but this reply warranted clarification.

On 11/01/2013 06:14 PM, Alek Paunov wrote:
Another simple answer: CSE is a low quality search - no facets, no (real)
content age restriction. The same is valid also for every other
service/application which is solely based on generic web pages crawling.

CSE is as full blown as a Google Appliance. More advanced than anything
you can write in Perl/Python/Ruby in a month. Site restrictions, keyword
restrictions, (real) age restrictions, autocomplete help, synonyms,
image search, all of which are provided through a XML API.[1]


Indeed. Don't get me wrong - I like CSE service for what it is good for. It seems that I had not been clear enough with my English - Sorry!

Nobody is able to write a good, modern index in a month - lucene/solr, xapian, etc, are all evolved in long, long years. Our task is a proper deployment of one or combination of them, not inventing a new.

Why e.g. solr instead of CSE or dpsearch (which is opensource, and also mentioned in the old tickets)?

Granularity: With CSE/dpsearch the indexed content unit is a crawled and automatically processed Web document (I say Web document instead of HTML page, because CSE handles many types). Not single BZ comment. Not change comment in a spec file. Not Git commit. Or in the reverse direction: Email, not thread (because we do not yet have yet archive page displaying the whole thread). I.e. there are no concept of document and subdocuments (in which most of our content belongs).

Attributes: You can not attach custom scalar/category attributes (the base of the faceted search) to the FTS indexed units.

Please correct me if I am wrong about CSE with some of the above.

Fedora has datasources (bugs, wikis, mails, packages, docs, etc,) not just sitemaps/pages, and they all talk about same things (common topic hierarchies, common tag hierarchies, common authors). They form highly interlinked virtual knowledge base.

We should start index the sources in their native structure now, to be able to upgrade some happy day to full blown semantic search (when available), which is actually what we badly need.

In our case, we are the owners of the content, we know how it is
structured, we
know where are the feeds with the pure content changes, we can
explicitly feed
the indexes with all named attributes of the content nodes and later
use them.

But you don't know how other people on the web find and link to Fedora
pages to provide accurate page ranking.


Personas: 1. Active Fedora contributor, 2. Fedora contributor, 3. Power Fedora user/sysadmin, 4. Fedora user, 5. Potential Fedora user, 6. IT journalist.

IMHO, at least for 1-3 the results ordering by recursive link-rank valuation (Google page ranking) is more an issue than an advantage.

For 4 (also important) the relevant sets are probably: the docs, part of wiki, ask.fp.o and might be users@. I don't know - not always stackoverflow 'relevance' top resuls on a set of keywords are the same as google with site:stackoverflow.com in the query ...

For 5-6 Google page ranking is probably the best, but they will use Google instead of search.fp.o anyway (at least initially, latter their more concrete queries would be more like 3-4 ones).

Kind Regards,
Alek

--
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]
  Powered by Linux