On Thu, Aug 11, 2011 at 1:46 AM, Patrick O'Callaghan pocallaghan@gmail.com wrote:
On Wed, 2011-08-10 at 09:40 -0500, Matthew J. Roth wrote:
Tim wrote:
I used to use the underscore, as it made sense (to me, and other programmers) as a substitute for a space. But there's two drawbacks:
- Try explaining to the clueless what an underscore is, and how to
type it. Try doing that again and again, and you get real sick of it.
- You have the messy combinations of punctuation such as:
Shakespeare_-_The_Taming_of_the_Shrew
Where it'd really be better to collapse all punctuation down to just one punctuation symbol. That's "better" as in "easier and more convenient," not more lexically correct. Remember these are URIs (i.e. codes), not general language.
- If you ever want a URI printed on a newspaper or magazine, whoever
types it may not be able to get an underscore into the text, unless they're familiar with how their publishing system works. And, even then, they may fail. Many of them will convert an underscore into an EM dash, since an underscore is hardly ever desired in print, yet proper dashes are wanted all the time.
- Host Names (or 'labels' in DNS jargon) as traditionally defined by
RFC 952 and RFC 1123 may be composed of upper and lower case characters, numeric characters, and the dash character. RFC 2181 significantly liberalized the valid character set including the use of "_" (underscore), but it is still a *good idea* to stick to the traditionally defined characters[¹].
It's become much worse than that with new classes of labels allowing non-ASCII character sets. See http://tools.ietf.org/html/rfc5890
Which speaks to the problems of context that I brought up earlier.
Ten years ago, Japanese people who used the internet could (more or less) read English, and Latinized (romaji) spellings of Japanese used in urls didn't cause many problems either.
These days, ordinary Japanese people use the internet, and the latin basic set urls are just as meaningless as telephone numbers to them. Less, perhaps. (Yeah, they get force-fed English in primary grades, but that doesn't mean it is even comfortable for them to "read" -- and comprehend -- new combinations of romaji.)
On the other hand, simply allowing Kanji to be used in urls is going to create as many problems as it solves. It would be almost easy to fold hiragana and katakana, but not even possible to fold kanji and kana. As a result, the ads you see in trains tend to show the katakana or hiragana for a company's name in a search box, with the search button being clicked.
As Paul points out, we should solve our problems in the local context first, since it's the one we best understand, and the one we probably need most to work in.
And then we try to figure out how to get things working in a broader context, and at some point we have to resort to a layer of translations (a human version of an API, perhaps?). And our minds tend to handle so much of this so well, that it's often a surprise how much detail you have to add to mechanical rules. And then there are problems that you just have to leave unsolved (and hope something works out), like the issues with Japanese in urls. And that's when there are no bugs.
(Sorry about the rant, but not sorry enough to refrain from posting it.)
Joel Rees