Fedora search

Sat Nov 2 04:57:54 UTC 2013

On 02.11.2013 02:32, Michael Cronenworth wrote:
> This will be my last mailing on this topic as I will not contribute or
> use this feature in Fedora, but this reply warranted clarification.
>
> On 11/01/2013 06:14 PM, Alek Paunov wrote:
>> Another simple answer: CSE is a low quality search - no facets, no (real)
>> content age restriction. The same is valid also for every other
>> service/application which is solely based on generic web pages crawling.
>
> CSE is as full blown as a Google Appliance. More advanced than anything
> you can write in Perl/Python/Ruby in a month. Site restrictions, keyword
> restrictions, (real) age restrictions, autocomplete help, synonyms,
> image search, all of which are provided through a XML API.[1]
>

Indeed. Don't get me wrong - I like CSE service for what it is good for. 
It seems that I had not been clear enough with my English - Sorry!

Nobody is able to write a good, modern index in a month - lucene/solr, 
xapian, etc, are all evolved in long, long years. Our task is a proper 
deployment of one or combination of them, not inventing a new.

Why e.g. solr instead of CSE or dpsearch (which is opensource, and also 
mentioned in the old tickets)?

Granularity: With CSE/dpsearch the indexed content unit is a crawled and 
automatically processed Web document (I say Web document instead of HTML 
page, because CSE handles many types). Not single BZ comment. Not change 
comment in a spec file. Not Git commit. Or in the reverse direction: 
Email, not thread (because we do not yet have yet archive page 
displaying the whole thread). I.e. there are no concept of document and 
subdocuments (in which most of our content belongs).

Attributes: You can not attach custom scalar/category attributes (the 
base of the faceted search) to the FTS indexed units.

Please correct me if I am wrong about CSE with some of the above.

Fedora has datasources (bugs, wikis, mails, packages, docs, etc,) not 
just sitemaps/pages, and they all talk about same things (common topic 
hierarchies, common tag hierarchies, common authors). They form highly 
interlinked virtual knowledge base.

We should start index the sources in their native structure now, to be 
able to upgrade some happy day to full blown semantic search (when 
available), which is actually what we badly need.

>> In our case, we are the owners of the content, we know how it is
>> structured, we
>> know where are the feeds with the pure content changes, we can
>> explicitly feed
>> the indexes with all named attributes of the content nodes and later
>> use them.
>
> But you don't know how other people on the web find and link to Fedora
> pages to provide accurate page ranking.
>

Personas: 1. Active Fedora contributor, 2. Fedora contributor, 3. Power 
Fedora user/sysadmin, 4. Fedora user, 5. Potential Fedora user, 6. IT 
journalist.

IMHO, at least for 1-3 the results ordering by recursive link-rank 
valuation (Google page ranking) is more an issue than an advantage.

For 4 (also important) the relevant sets are probably: the docs, part of 
wiki, ask.fp.o and might be users at . I don't know - not always 
stackoverflow 'relevance' top resuls on a set of keywords are the same 
as google with site:stackoverflow.com in the query ...

For 5-6 Google page ranking is probably the best, but they will use 
Google instead of search.fp.o anyway (at least initially, latter their 
more concrete queries would be more like 3-4 ones).

Kind Regards,
Alek