Code Search for Fedora

Mon Dec 1 19:16:22 UTC 2014

On Tue, 18 Nov 2014 13:00:22 -0800
Michael Stapelberg <michael+fedora at stapelberg.ch> wrote:

> Hey,
> 
> Recently I’ve been talking to Hannes (cc'ed) about whether Fedora
> would be interested in having the equivalent of
> http://codesearch.debian.net/¹
> 
> The project came to live as my Bachelor of Science Thesis² and aims to
> provide fast regular expression search over a big corpus, in this case
> 140 GB of source code of all software included in the Debian main
> distribution (as opposed to non-free or contrib, which we excluded
> because of licensing concerns). It is based on the work Russ Cox
> published, which in turn resembles the work he did on Google Code
> Search when he was an intern there in 2006.
> 
> So, what’s this discussion about?
> 
> What I’m offering is setting up/running a public version of Code
> Search for Fedora. It needs to be public because I want the open
> source community as a whole profit from it, and also I’m told you have
> somewhat comparable tools internally anyway :).

Thanks for starting the conversation.

<snip>

> I feel like this email is long enough already, so I’ll just ask a
> general: what do you think? Do you need any more information? Please
> just ask, and keep me CC'ed, since I’m not subscribed to this list.

I'm a little late to the discussion, but I think that code search
sounds like a cool idea if we can find the human/machine resources to
do it. I've only glanced through all the docs so far but I have a couple
of concerns (some of which have already been raised). I hope this
doesn't sound like I'm completely against the idea, though - I wouldn't
have spent the time to go through your thesis and respond to the
discussion if that were the case.

Tim

Single points of (human) failure
--------------------------------

Kevin already brought this up but I'm a little worried about supporting
a large, complex application like this with only one person familiar
with it and few people around familiar with the language that its core
is written in. Speaking as someone who has been the single point of
failure in an application deployment before, I'd strongly suggest
finding someone to help. Finding out that something went down when you
were/are on vacation and there's nobody else that can fix it is not
fun :)

Node Failure Behavior
---------------------

I'm not clear from the docs I went through how node failure is handled.
I don't see any explicit mention of it, so I'm assuming that the index
shards all have single copies of the index. How does the system handle
failure in one of the index nodes? My first guess is that there would
be missing results from queries but I haven't gotten into the actual
code yet.

Code resiliency
---------------

I think that this is lessened a bit since the code has been running in
production but it sounds like the base code for indexing was meant
somewhat as a "proof of concept" or small-scale deployment. It sounds
like you've made quite a few enhancements on top of the released google
codesearch but tried to leave the core code alone as much as possible.
Have you seen many problems in the index nodes for DCS?

Indexing
--------

Have there been any complaints/comments about your chosen update delta
of 3 days? You assert that 3 days is a good balance between indexing
load and keeping fresh code in the index but I don't see a
justification in your thesis. How did you come to the conclusion that 3
days was an optimal choice?

Is the indexing process automated or does it need to be kicked off by a
human? The way it's described in your thesis ("after verifying that no
human mistake was made by confirming that Debian Code Search still
delivers results ..."), it sounds somewhat manual.

Is there a downtime when updating the inverted trigram index? If so,
how long is that? Does it happen for every re-index? It sounds like the
resources for the index nodes would be almost 100% utilized after
indexing, leaving no additional resources to handle the re-indexing
load. Am I misunderstanding something in the architecture?

Ranking
-------

Is the result ranking code compatible with non-debian sources? We don't
have an equivalent to popcon and I assume that the reverse dependency
factor would need different code in for Fedora than in Debian. Or is
this part of the modifications you were planning for already?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20141201/f1a20a90/attachment.sig>