Code Search for Fedora

Michael Stapelberg michael+fedora at stapelberg.ch
Tue Nov 18 21:24:02 UTC 2014


Thanks for your quick reply!

On Tue, Nov 18, 2014 at 1:16 PM, Kevin Fenzi <kevin at scrye.com> wrote:
> On Tue, 18 Nov 2014 13:00:22 -0800
> Michael Stapelberg <michael+fedora at stapelberg.ch> wrote:
>
>> Hey,
>
> Greetings.
>
>> Recently I’ve been talking to Hannes (cc'ed) about whether Fedora
>> would be interested in having the equivalent of
>> http://codesearch.debian.net/¹
>>
>> The project came to live as my Bachelor of Science Thesis² and aims to
>> provide fast regular expression search over a big corpus, in this case
>> 140 GB of source code of all software included in the Debian main
>> distribution (as opposed to non-free or contrib, which we excluded
>> because of licensing concerns). It is based on the work Russ Cox
>> published, which in turn resembles the work he did on Google Code
>> Search when he was an intern there in 2006.
>>
>> So, what’s this discussion about?
>>
>> What I’m offering is setting up/running a public version of Code
>> Search for Fedora. It needs to be public because I want the open
>> source community as a whole profit from it, and also I’m told you have
>> somewhat comparable tools internally anyway :).
>
> We have talked about a code search type application several times in
> the past, but never got as far as coding.
>
> Some things to note about our infrastructure:
>
> Everything we use must be under a free license:
> https://fedoraproject.org/wiki/Infrastructure_Licensing
> (which I don't think will be a problem, just noting it. ;)
Yep, that’s certainly the case. See
https://github.com/Debian/dcs/blob/master/LICENSE

>
> We have a process for bringing up new applications, called "Request For
> Resources":
> https://fedoraproject.org/wiki/Request_For_Resources?rd=Infrastructure/RFR
>
> Through this process we make sure there's more than one person that
> knows how the application works and can fix it, it's monitored right,
> etc.
I’ve had very quick glance only so far, but the general idea sounds
reasonable. I’m not sure who’d want to work with me on the project,
but perhaps we can find someone who’s interested.

>
>>
>> My motivation comes from multiple places:
>>
>> 1) I’m fairly sure Fedora packages a slightly different set of
>> software than Debian, so running both DCS (Debian Code Search) and FCS
>> (Fedora Code Search) would enlarge the amount of searchable software.
>
> Probibly true. Also, possibly differing versions...
>
>> 2) I’m interested in my work having a positive effect on the world (or
>> at least the open source community), and running multiple instances of
>> Code Search reduces its dependency on any single distribution, thereby
>> increasing its reliability and scope.
>
> Reasonable.
>
>> 3) Last but not least, I intend to try Fedora on one of my computers
>> to broaden my horizons. I figured getting in contact with some of you
>> while working on this project may be a good way to set a foot into the
>> community and see whether I like it around here.
>
> Welcome. :) Hope you like it
>
>> In terms of what I’d need in order to make this project a success,
>> there are some hardware requirements (aside from, of course, time and
>> motivation):
>>
>> The in-memory index and searchable source code can be sharded on an
>> almost arbitrary number of different computers, which is necessary to
>> some extent, due to maximum size limitations for the index of a single
>> shard to be < 2 GB. At the moment, we are running 6 different
>> index-backend VMs, each serving 1.8G in-memory indexes and about 40G
>> of source code (including partial indexes). In order to grep through
>> the source quickly, the source is stored on local SSDs (as opposed to
>> a network block storage volume, or even regular HDDs).
>
> We currently don't have any SSD's. ;(
>
>> In addition to the actual data, we also need a web frontend to serve
>> and combine this data, and we have one more VM which scrapes
>> monitoring information and shows nice graphs about how the whole
>> system behaves.
>>
>> So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G
>> of RAM (for 2G of index + 2G page cache for grepping files) and 40G
>> SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and
>> also an SSD for caching entire query results. The monitoring VM needs
>> just one core and 2G of RAM.
>>
>> Does that sound reasonable and feasible? I’m not sure what kind of
>> hardware you have available for projects like this one, and currently
>> we’re sponsored by Rackspace because Debian doesn’t have that sort of
>> hardware easily available.
>
> Well, we don't have any virthosts with SSD's currently, so that could
> be a hangup. We do have virthosts and memory/SAS disks.
That’s a bummer. How many IOPS do your SAS disks provide? Is there any
chance that you could get some SSDs in the near to mid term future?

>
>> I feel like this email is long enough already, so I’ll just ask a
>> general: what do you think? Do you need any more information? Please
>> just ask, and keep me CC'ed, since I’m not subscribed to this list.
>
> I think before we go looking into hardware requirements, we should
> discuss the software? Whats it written in? Is there a bunch of people
> who work on it? or just you?
It’s written in Go, and mostly I’m working on it, with a few random
contributions from other people from time to time.

>
> We would want it packaged up as rpms for deployment, preferably for
> epel7 (to work on rhel7 hosts).
Yeah, I’ve heard about that, and it shouldn’t be a problem, I think. I
assume the Go compiler is in EPEL7.

>
> Would you be open to changes in code/architecture to meet our setup
> better?
Of course, yeah.


More information about the infrastructure mailing list