Systematically crawling Fedoraproject.org repositories

Ben St. John jbstjohn at google.com
Thu Sep 2 16:16:16 UTC 2010


On Thu, Sep 2, 2010 at 5:34 PM, Mike McGrath <mmcgrath at redhat.com> wrote:
> On Thu, 2 Sep 2010, Pascal Minnerup wrote:
>
>> Dear Fedora team,
>>
>> We on the Google Code Search project (www.google.com/codesearch) want to improve the quality of our index, and as part of that, would like to systematically crawl the fedora
>> git repositories of fedoraproject.org, which we consider one of the major hosts of open source. Our crawlers use bandwidth throttling that should ensure that we don't
>> overstress your web servers.
>>
>> 1. Is it okay for you if we systematically crawl your git repositories for new source code?
>>
>> 2. How would you recommend we get the repository directories? Our current approach would be to get the git repositories of recently updated packages from this page:
>> http://pkgs.fedoraproject.org/gitweb/?o=age.
>>
>> 3. Are there any particular times or actions we should _avoid_?
>>
>> 4. Is there any particular person we should talk to in the future?
>>
>> An answer to these questions would be very helpful in improving the presence of Fedora code files in Code Search. We look forward to hearing from you.
>>
>
> Thanks for contacting us, we really don't know how that would all react
> but I'm ok with it provided we can contact you to change things later if
> things do go south?
>
>        -Mike

Of course! We'll try to give you a heads-up the first time we crawl
it, so if you do notice anything strange, you'll know who to blame!

Thanks,
Ben

Ben St. John
jbstjohn at google.com

Tel: +49 (0) 89 83 930-9054
Fax:+49 (0) 89 83 930-9001
Google Germany GmbH
Dienerstr. 12
80331 München

AG Hamburg, HRB 86891  |  Sitz der Gesellschaft: Hamburg
Geschäftsführer: Nikesh Arora, John Herlihy, Graham Law, Lloyd Martin,
Kent Walker



More information about the websites mailing list