Searching again...

Eric H. Christensen sparks at fedoraproject.org
Thu Feb 2 22:14:47 UTC 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Feb 02, 2012 at 08:44:18PM +0000, Robert 'Bob' Jensen wrote:
> 
> ----- "Kevin Fenzi" <kevin at scrye.com> wrote:
> 
> > So, I got to looking at search engines again the other day. In
> > particular the horrible horrible mediawiki one we are using on the
> > wiki. 
> > 
> > This pointed me to sphinx. 
> > 
> > - There is a mediawiki sphinx plugin. (needs packaging)
> > - sphinx is c++ and already packaged. 
> > - sphinx uses mysql directly to index the database contents. 
> > - You can pass other data into it via an xml format. This could be a
> >   pain for any non wiki setups. 
> > 
> > It was noted that the new tagger application uses xapian as it's
> > search
> > engine. 
> > 
> > - xapian is also c++
> > - xapain has a web crawler/indexer (omega) that could index our other
> >   stuff more easily than sphinx. 
> > - There's no mediawiki plugin for xapian, but we could point the wiki
> >   search box to a site wide search using xapian. 
> > 
> > So, there's tradeoffs either way. 
> > 
> > Would anyone care to lead an effort to test these two? 
> > xapian would probably be easy to test from anywhere. 
> > sphinx might require some access to our mediawiki database, but you
> > could also just setup a new mediawiki, the plugin and sphinx and see
> > how it works there. 
> > 
> > If no one steps up I can look at doing it next week. ;) 
> > 
> 
> My concern has always been the wiki content search being horrible as Kevin also mentioned. For me sphinx sounds like the best tool for that job out of the box from the description provided. I have a couple concerns that we need to be sure to test with xapian being a crawler. 
> 
> - Will this work for pages on the wiki that are already hard to find because they are not linked to from anywhere? 
> - Are we sure it will work on docs.fp.o and it's JavaScript navigation menu?
> 
> I am willing to help out testing if another can take the lead on it. 
> 
> -- Bob

When we were discussing sphinx the other day I seem to remember something about it being able to read docbook (or am I just mis-remembering the entire conversation).  That could be interesting for docs.fp.o.  Docs.fp.o has a failback mode for the javascript with an document index of sorts that could be helpful for crawling.

The web crawling functionality sounds interesting but, like Bob noted, if wiki pages aren't linked then they may never be found.  Do we know exactly what the mw plugin does in sphinx?

I like the idea of having a site-wide search feature as you don't know if the answer you seek is in a document, the wiki, or a webpage.

Depending on how badly my day job keeps me moving this weekend I could possible test one or the other.  I think I'd like to look at xapian just to see how well it indexes the wiki.

- --Eric

- --------------------------------------------------
Eric H Christensen        eric at christensenplace.us
"Sparks"                  sparks at fedoraproject.org
   .... .. .-.. .-.. ---   .-- --- .-. .-.. -..
097C 82C3 52DF C64A 50C2  E3A3 8076 ABDE 024B B3D1
- --------------------------------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQIcBAEBAgAGBQJPKwrXAAoJEIB2q94CS7PRnXsQAOFNvAjmzeizDwqDMNNfVuo2
NVj1Z7YFk21/qXOmkllR03zkqFaPdbTVOIrfoJNF5wiK0CNi1quJZqTx+WsyUE9/
1faVuoBXqhvbyBoC/MY7zb+++JjJ3E3KhIhU13pcoL1+Kch/WiyQncdnBDFUyz9M
4nUUK7DzUnKHHEivLkUpWt1EMOfswmsZy9NUNWuUsAOB/w6ytVCLHD/0UyPJmmUT
hwKMyvWtAvRkYRWSempvSKzgFabfV0OFOPTFS+rbySXDnKlTPPpgxTsOsbAIaZiI
rb5p+9BeLGCTC47ytOjdBQ7sZHHymiNzF5am9rCs5fY7GGi1VBLFOWmQmnNRTGno
h7bC1QujCsFn3larYxCbTac/kjOnlhWi9SqkS/sTo3lYp/ysO5Gbp8vULJi0/xrl
mwI4+5DChcMW19qTi1Q+jG/jZ7yhgbVcQ/DcEbX4r666kGDUWtSgvSDtP8UuaXrn
BDcLDapqJW6KHUlEMhNspCMk2vYNaMmjIXGQ7/ysSWpCdvSTIYxrKtidZgehqsFS
RqizqQgfcOm7KGV4VjTH5ReQbsBxoy1euHMQSwlN/3B1usY7T2yDS+BdMX4ZVVjp
Sop6r6Js/TSUdTLZ7d/Fi0zO8nhumjjJk8P2VpjOHBP3UdfAKbFmfSIeBnsexcUX
q/oYxMdGu3DBSzWmrfvT
=fHr2
-----END PGP SIGNATURE-----


More information about the infrastructure mailing list