[SOLVED] Re: html on Fedora -- looking for "where to go"

Joel Rees joel.rees at gmail.com
Wed Aug 10 23:19:48 UTC 2011


On Thu, Aug 11, 2011 at 1:46 AM, Patrick O'Callaghan
<pocallaghan at gmail.com> wrote:
> On Wed, 2011-08-10 at 09:40 -0500, Matthew J. Roth wrote:
>> Tim wrote:
>> >
>> > I used to use the underscore, as it made sense (to me, and other
>> > programmers) as a substitute for a space.  But there's two drawbacks:
>> >
>> > 1.  Try explaining to the clueless what an underscore is, and how to
>> > type it.  Try doing that again and again, and you get real sick of it.
>> >
>> > 2.  You have the messy combinations of punctuation such as:
>> >
>> >         Shakespeare_-_The_Taming_of_the_Shrew
>> >
>> > Where it'd really be better to collapse all punctuation down to just one
>> > punctuation symbol.  That's "better" as in "easier and more convenient,"
>> > not more lexically correct.  Remember these are URIs (i.e. codes), not
>> > general language.
>> >
>> > 3.  If you ever want a URI printed on a newspaper or magazine, whoever
>> > types it may not be able to get an underscore into the text, unless
>> > they're familiar with how their publishing system works.  And, even
>> > then, they may fail.  Many of them will convert an underscore into an EM
>> > dash, since an underscore is hardly ever desired in print, yet proper
>> > dashes are wanted all the time.
>>
>> 4. Host Names (or 'labels' in DNS jargon) as traditionally defined by
>> RFC 952 and RFC 1123 may be composed of upper and lower case
>> characters, numeric characters, and the dash character.  RFC 2181
>> significantly liberalized the valid character set including the use of
>> "_" (underscore), but it is still a *good idea* to stick to the
>> traditionally defined characters[¹].
>
> It's become much worse than that with new classes of labels allowing
> non-ASCII character sets. See http://tools.ietf.org/html/rfc5890

Which speaks to the problems of context that I brought up earlier.

Ten years ago, Japanese people who used the internet could (more or
less) read English, and Latinized (romaji) spellings of Japanese used
in urls didn't cause many problems either.

These days, ordinary Japanese people use the internet, and the latin
basic set urls are just as meaningless as telephone numbers to them.
Less, perhaps. (Yeah, they get force-fed English in primary grades,
but that doesn't mean it is even comfortable for them to "read" -- and
comprehend -- new combinations of romaji.)

On the other hand, simply allowing Kanji to be used in urls is going
to create as many problems as it solves. It would be almost easy to
fold hiragana and katakana, but not even possible to fold kanji and
kana. As a result, the ads you see in trains tend to show the katakana
or hiragana for a company's name in a search box, with the search
button being clicked.

As Paul points out, we should solve our problems in the local context
first, since it's the one we best understand, and the one we probably
need most to work in.

And then we try to figure out how to get things working in a broader
context, and at some point we have to resort to a layer of
translations (a human version of an API, perhaps?). And our minds tend
to handle so much of this so well, that it's often a surprise how much
detail you have to add to mechanical rules. And then there are
problems that you just have to leave unsolved (and hope something
works out), like the issues with Japanese in urls. And that's when
there are no bugs.

(Sorry about the rant, but not sorry enough to refrain from posting it.)

Joel Rees


More information about the users mailing list