alex at declera.com
Sat Nov 2 04:57:54 UTC 2013
On 02.11.2013 02:32, Michael Cronenworth wrote:
> This will be my last mailing on this topic as I will not contribute or
> use this feature in Fedora, but this reply warranted clarification.
> On 11/01/2013 06:14 PM, Alek Paunov wrote:
>> Another simple answer: CSE is a low quality search - no facets, no (real)
>> content age restriction. The same is valid also for every other
>> service/application which is solely based on generic web pages crawling.
> CSE is as full blown as a Google Appliance. More advanced than anything
> you can write in Perl/Python/Ruby in a month. Site restrictions, keyword
> restrictions, (real) age restrictions, autocomplete help, synonyms,
> image search, all of which are provided through a XML API.
Indeed. Don't get me wrong - I like CSE service for what it is good for.
It seems that I had not been clear enough with my English - Sorry!
Nobody is able to write a good, modern index in a month - lucene/solr,
xapian, etc, are all evolved in long, long years. Our task is a proper
deployment of one or combination of them, not inventing a new.
Why e.g. solr instead of CSE or dpsearch (which is opensource, and also
mentioned in the old tickets)?
Granularity: With CSE/dpsearch the indexed content unit is a crawled and
automatically processed Web document (I say Web document instead of HTML
page, because CSE handles many types). Not single BZ comment. Not change
comment in a spec file. Not Git commit. Or in the reverse direction:
Email, not thread (because we do not yet have yet archive page
displaying the whole thread). I.e. there are no concept of document and
subdocuments (in which most of our content belongs).
Attributes: You can not attach custom scalar/category attributes (the
base of the faceted search) to the FTS indexed units.
Please correct me if I am wrong about CSE with some of the above.
Fedora has datasources (bugs, wikis, mails, packages, docs, etc,) not
just sitemaps/pages, and they all talk about same things (common topic
hierarchies, common tag hierarchies, common authors). They form highly
interlinked virtual knowledge base.
We should start index the sources in their native structure now, to be
able to upgrade some happy day to full blown semantic search (when
available), which is actually what we badly need.
>> In our case, we are the owners of the content, we know how it is
>> structured, we
>> know where are the feeds with the pure content changes, we can
>> explicitly feed
>> the indexes with all named attributes of the content nodes and later
>> use them.
> But you don't know how other people on the web find and link to Fedora
> pages to provide accurate page ranking.
Personas: 1. Active Fedora contributor, 2. Fedora contributor, 3. Power
Fedora user/sysadmin, 4. Fedora user, 5. Potential Fedora user, 6. IT
IMHO, at least for 1-3 the results ordering by recursive link-rank
valuation (Google page ranking) is more an issue than an advantage.
For 4 (also important) the relevant sets are probably: the docs, part of
wiki, ask.fp.o and might be users at . I don't know - not always
stackoverflow 'relevance' top resuls on a set of keywords are the same
as google with site:stackoverflow.com in the query ...
For 5-6 Google page ranking is probably the best, but they will use
Google instead of search.fp.o anyway (at least initially, latter their
more concrete queries would be more like 3-4 ones).
More information about the devel