docs.fp.o Search [Was: Re: CMS]

brett lentz wakko666 at gmail.com
Mon Jun 7 16:44:40 UTC 2010


On Mon, Jun 7, 2010 at 8:20 AM, Paul W. Frields <stickster at gmail.com> wrote:
> Including Infrastructure team gurus on this email thread for
> additional expert advice and assistance. :-) Setting reply-to docs@
> list.
>
> On Sat, Jun 05, 2010 at 05:55:05AM -0400, Eric Sparks Christensen wrote:
>> On 06/05/2010 03:14 AM, Ruediger Landmann wrote:
>> > One of the new features of Publican 2.0 that I haven't mentioned yet is
>> > that it creates an XML sitemap for search engine bots to crawl. You can
>> > find d.fp.o.'s sitemap here:
>> >
>> > http://docs.fedoraproject.org/Sitemap
>>
>> Awesome.
>>
>> >
>> > I've fed this to Google, Yahoo, and Bing, and they're all slowly
>> > re-indexing the site. The map now contains a little over 2,000 URLs and
>> > at the time of writing, Google has crawled about 350 of them.
>>
>> I know that Google has some algorithm that figures out how often your
>> site changes and then crawls more or less frequently.  Not sure if we
>> could work with Google on scheduling this more or less around the time
>> of a release.  Of course I'm guessing that we won't have this big
>> re-structuring next time, either.
>>
>> >
>> > The dilemma we face is the decision of when to turn off the 404
>> > redirect. For the sake of all the existing links scattered around the
>> > net (both on the Fedora Project site and off it), we'd want to postpone
>> > this as far as possible. On the other hand, any bot attempting to verify
>> > that link gets a page served up and probably concludes that the link is
>> > valid; I suspect that if these links 404ed, they'd start to evaporate
>> > from search results.
>>
>> Isn't there a type of redirect (302?) that tells you that you are being
>> redirected so you don't think the URL is valid?
>
> I thought that 301 and 302 both do this, but 302 generally is used for
> temporary redirects, and 301 for permanent ones.  So 301 would be the
> one you're thinking of here perhaps?
>

It's 301.  RFC2616 has the details:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3

>> > Given that existing links around the net are pointing to (at most
>> > recent) the F12 versions of docs, there will be no need to keep the 404
>> > redirect in place past October; however, if we want to start allowing
>> > dead links to 404 out rather than poison search results, maybe we should
>> > bring that date forward? The sooner we do this, the sooner search will
>> > start working properly...
>>
>> Yeah, the sooner the better, IMO.
>
> I wonder how long you would need 301's in place before it's safe to
> remove them?  Because I think Infrastructure's not keen on maintaining
> a big list of these.
>

I'm not aware of any specific time frame defined by the RFCs or other
standards. Each site has a different policy for redirect maintenance.
There are 3 basic options:

1. Watch the access logs for a majority of the access requests to use
the new URL. When the pre-determined threshold is met, remove the
redirect and accept that some users who still haven't updated their
links will receive a 404 (Object Not Found).

2. Set a date-based transition period. After the specified date,
either remove the redirect, or apply the logic #1 and bump out the cut
off date accordingly.

3. Keep the redirect indefinitely, or until the maintenance costs are
too cumbersome. Many organizations use this one, for better or worse.

> --
> Paul W. Frields                                http://paul.frields.org/
>  gpg fingerprint: 3DA6 A0AC 6D58 FEC4 0233  5906 ACDB C937 BD11 3717
>  http://redhat.com/   -  -  -  -   http://pfrields.fedorapeople.org/
>          Where open source multiplies: http://opensource.com
> _______________________________________________
> infrastructure mailing list
> infrastructure at lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/infrastructure
>

---Brett.


More information about the infrastructure mailing list