Dear Mark,
I thank you for this list of remarks. As engaged into the MDRS
project of multilingual distributed registry system for our own
information service to users communities and open use, I share
many of these points. Let me add the comments/suggestions we
have.
At 07:52 23/04/2006,
Mark Davis wrote:
The UTC strongly supports many of the goals of the document,
including especially improving the security of IDNs, and updating
the version of Unicode used in NamePrep and StringPrep (since
the old version of Unicode they require excludes or hampers many
languages).
My user QA hat on: there is a
need to have a list or a place to record the different available
libraries and the Unicode version they support. The same for
the langtags libraries. This should be a normal IDNA and RFC
3066 Bis "after sales" support.
There are, however, a number
of areas of concern.
As a general issue, we'd urge closer cooperation between the
IAB and the Unicode consortium on the document, so that the character
encoding and software internationalization issues can be reviewed
by experts in the field, and accurately represented in the document.
There are two layers. Characters
and languages. There are in constant confusion at the IETF.
There is certainly advantage
in having a better knowledge of IETF on both layers. For years
the IETF is lobbied by Unicode Members in this area. The result
is confusion, as the RFC 3066 Bis episode and aftermaths show
it. I understand that the commercial nature of the Unicode is
a problem for the IETF. But the result on the Unicode Member
participating to the IETF is an exclusion natural reflex: they
try to keep a coherent doctrine. This may create hurting behaviours
and ethic problems. This is why I would strongly advocate an
MoU on the characters related aspects between the IETF and Unicode.
Such an MoU should clearly identify the ISO as a reference, Unicode
as a non exclusive but leading source of expertise, and final
leadership of the users in the usage areas.
This would permit to differentiate
the internet architectural characters layer where Unicode is
truly a reference, and the lingual layer where there is total
confusion. Internationalization (extending the character set
with non-ASCII characters) is by nature inapplicable to multilingualisation
(parallel support of excluding terminologies to express the same
concepts). The result is cacophony, like in IDNA.
As RFC 3066 Bis shows it the
Unicode/IETF doctrine and tools (libraries and CLDR) is unable
at this stage to address dialects, sociolects and idiolects.
This is quite worrying since this may be an area of urgent standardisation
effort and certainly strong demand prompted by the Internet possibilities.
This is certainly a significant MDRS concern and we know we will
actively use digital space names (not only Internet) to support
them.
The chief area
of concern is section 4.3.
4.3. Combining
Characters and Character Components
One thing that
increases IDNA complexity and the need for
normalization is that combining characters are permitted. Without
them, complexity might be reduced enough to permit more easy
transitions to new versions. The community should consider whether
combining characters should be prohibited entirely from IDNs.
A
consequence of this, of course, is that each new language or
script
would require that all of its characters have Unicode assignments
to
specific, precomposed, code points, a model that the Unicode
Consortium has rejected for Roman-based scripts. For non-Roman
scripts, it seems to be the Unicode trend to define such code
points.
At some level, telling the users and proponents of scripts that,
at
present, require composing characters to work the issues out
with the
Unicode Consortium in a way that severely constrains the need
for
those characters seems only appropriate. The IAB and the IETF
should
examine whether it is appropriate to press the Unicode Consortium
to
revise these policies or otherwise to recommend actions that
would
reduce the need for normalization and the related complexities.
The descriptions
and recommendations in this section are simply not feasible.
They do not recognize the fundamental importance of combining
marks as an integral component of a great many scripts, nor do
they recognize the fundamental need for compatibility that is
required of the Unicode Standard. Asking for combining characters
to be removed is akin to asking English vowels to be removed,
and all possible syllables to be encoded instead. There are,
as well, a number of purely factual errors. For example, "it
seems to be the Unicode trend to define such code points"
is simply incorrect. This section serves no purpose but to betray
a basic lack of understanding of scripts; it needs to be removed
entirely.
The worry I have here is the
reference to Unicode as if having the capacity to do this. It
simply denotes a lack of understanding of the addressed issue.
The world does not go by Unicode but by ISO. Even if Unicode
and ISO are considered as hand in glove such a text shows the
need of the MoU I advocate, at least to underline that ISO is
the common reference. At least if the IETF wants to be international.
But even if the IETF wanted to limit itself to the Internationalised
US Internet such a request would be a request for balkanization.
A second area
of major concern is Section 2.2.3.
2.2.3. Normalization
and Character Mappings
Unicode contains
several different models for representing
characters. The Chinese (Han)-derived characters of the "CJK"
languages are "unified", i.e., characters with common
derivation and
similar appearances are assigned to the same code point. European
characters derived from a Greek-Roman base are separated into
separate code blocks for "Latin", Greek and Cyrillic
even when
individual characters are identical in both form and semantics.
Separate code points based on font differences alone are generally
prohibited, but a large number of characters for "mathematical"
use
have been assigned separate code points even though they differ
from
base ASCII characters only by font attributes such as "script",
"bold", or "italic". Some characters that
often appear together are
treated as typographical digraphs with specific code points assigned
to the combination, others require that the two-character sequences
be used, and still others are available in both forms. Some
Roman-
based letters that were developed as decorated variations on
the
basic Latin letter collection (e.g., by addition of diacritical
marks) are assigned code points as individual characters, others
must
be built up as two (or more) character sequences using "composing
characters".
This section
betrays a lack of understanding of the fundamental differences
between Han characters and the scripts Latin, Greek, and Cyrillic.
Many of these
differences result from the desire to maintain backward
compatibility while the standard evolved historically, and are
hence
understandable. However, the DNS requires precise knowledge
of which
codes and code sequences represent the same character and which
ones
do not. Limiting the potential difficulties with confusable
characters (see Section 2.2.6) requires even more knowledge of
which
characters might look alike in some fonts but not in others.
These
variations make it difficult or impossible to apply a single
set of
rules to all of Unicode. Instead, more or less complex mapping
tables, defined on a character by character basis, are required
to
"normalize" different representations of the same character
to a
single form so that matching is possible.
The Unicode consortium
*does* supply a precise mechanism for determining when two strings
represent the same underlying abstract characters. These do supply
a single set of rules to all of Unicode, based on a set of data
that is in the Unicode Character Database.
This paragraph
also conflates the confusable issue with character equivalence.
These are separate issues: there are great many instances where
characters are confusable where they are not at all equivalent
(such as zero and the letter O).
... The fact
that most or all scripts included in Unicode have been initially
incorporated by copying an existing standard more or less intact
has
impact on the optimization of these algorithms and on forward
compatibility. Even if the language is known and language-specific
rules can be defined, dependencies on the language do not disappear.
Any canonicalization operations that depend on more than short
sequences of text is not possible to do without context. DNS
lookups
and many other operations do not have a way to capture and utilize
the language or other information that would be needed to provide
that context.
First, it is
neither "most" nor "all". Very few scripts,
proportionately, have been incorporated by copying an existing
standard. Second, "Any canonicalization operations that
depend on more than short sequences of text is not possible to
do without context...." is difficult to make sense of. One
would have to explain the sense of "canonicalization"
that is being discussed. It could be as trivial as "language-based
canonicalization is impossible without language information",
which is true, but above the document argues against using language-based
equivalences on a global basis (and for very good reason!)
This is clearly the result of
layer violation confusion by network architects between characters
and languages issues. The solution is not in changing ISO but
in preventing the problem to exist on the network side.
===
Other areas of
concern:
(more properly
"Roman", see below)
The common modern practice in the naming of the script is to
use the term "Latin", not "Roman". Whether
or not one thinks that should not have been the case, insisting
on older terms is pointless, and not germane to the purpose of
the document.
+1
they were bidi used in Latium before Rome was even born.
They are Etruscan.
When writing
or typing the label (or word), a script must be selected
and a charset must be picked for use with that script.
This is confusing
charset, keyboard and script. Saying "a script must be selected"
is *neither* true from the user's perspective, nor does it at
all match the implementation pipeline from keypress to storage
of a label. What may have been confusing for the authors is that
sometimes keyboards that are listed for selection are sorted
by script; that does not, however, mean that a "script is
selected".
The proper word,
if more substantial changes are not made to the wording, would
be "a keyboard must be selected". (Even that is a quite
odd, since it implies that that is done each time a user types
a label.)
This is what happens in IDNs.
Only so called "IDN.IDN" can be typed with a single
keyboard.
If that charset, or the
local charset being used by the relevant
operating system or application software, is not Unicode, a further
conversion must be performed to produce Unicode. How often this
is
an issue depends on estimates of how widely Unicode is deployed
as
the native character set for hardware, operating systems, and
applications. Those estimates differ widely, with some Unicode
advocates claiming that it is used in the vast majority of systems
and applications today. Others are more skeptical, pointing
out
that:
o ISO 8859 versions
[ISO.8859.1992] and even national variations of
ISO 646 [ISO.646.1991] are still widely used in parts of Europe;
o code-table switching methods, typically based on the techniques
of
ISO 2022 [ISO.2022.1986] are still in general use in many
parts of
the world, especially in Japan with Shift-JIS and its variations;
o that computing, systems, and communications in China tend
to use
one or more of the national "GB" standards rather
than native
Unicode;
o and so on.
Not all charsets
define their characters in the same way and not all
pre-existing coding systems were incorporated into Unicode without
changes. Sometimes local distinctions were made that Unicode
does
not make or vice versa. Consequently, conversion from other
systems
to Unicode may potentially lose information.
Most of this
section is unnecessary and the thrust of it is misleading. The
only issue is "local distinctions" are lost when converting
to Unicode; that doesn't happen when converting from any of the
examples listed. This passage implies that there are significant
problems in mapping to Unicode in doing IDN, and there simply
aren't.
This can only be documented by
a complete registry documenting all the actually existing/used
charsets. We cannot go by "Most", we must go by "Every".
... Worse, one
needs to be reasonably
familiar with a script and how it is used to understand how much
characters can reasonably vary as the result of artistic fonts
and
typography. For example, there are a few fonts for Latin characters
that are sufficiently highly ornamented that an observer might
easily
confuse some of the characters with characters in Thai script.
The confusion
of Latin with Thai is a red herring. It would take an exceedingly
contrived scenario for it to present a problem. There are plenty
of realistic scenarios involving confusables across, say, Latin
and Cyrillic.
... IDNA
prohibits these mixed-directional (or bidirectional) strings
in IDN
labels, but the prohibition causes other problems such as the
rejection of some otherwise linguistically and culturally sensible
strings. As Unicode and conventions for handling so-called
bidirectional ("BIDI") strings evolve, the prohibition
in IDNA should
be reviewed and reevaluated.
Deviating from
the practices already built into IRI would be a mistake. As the
document recognizes above, it cannot be a goal to represent all
possible "linguistically and culturally sensible strings"
in IDNs. The restrictions on BIDI are ones that have achieved
broad consensus as the minimal ones to help avoid some fairly
serious security issues.
This is character/language layer
violation. Mark presents a sound character layer solution. This
does not address all of the language issue. If this is a real
problem at language issue, that issue should be specified separately.
This is the only way to have an operational service and further
on may be to improve it. DNS does not support upper cases. IDNA
restricts on BIDI. May be solution found for the mailnames will
help finding a solution.
4.1.2. Elimination
of word-separation punctuation
... We might even
consider banning use of the hyphen itself in non-ASCII strings
or,
less restrictively, strings that contained non-Roman characters.
This section
is not well motivated. The authors need to justify why such characters
represent a problem (and one of such a serious nature that hyphens
should be disallowed).
Hyphen removal would remove the
possibility to include langtags in a domain name to support multilingual
versions of a site. Better to scrap RFC 3066 Bis then.
-----
* Section 2.2.3:
"characters that are essentially identical will not match"
What is meant by "essentially identical"? Does this
mean identical in appearance, identical in internal representation,
identical in semantics, canonically equivalent (same NFC forms),
or compatible equivalent (same NFKC forms)? The intent needs
to be clarified, otherwise the statement is subject to misinterpretation.
+1
* Section 2.2.3:
"This Unicode normalization process [does not account for]
equivalences that are language or script dependent"
Which what is meant by "script-dependent equivalences"?
Can you provide an example?
What are language
equivalences for the DNS. USA.com and United-States-of-America.com
are language equivalent DNs. Is registering one forbidding to
register the other?
* Section 2.2.3:
"U+00F8 [...] and U+00F6 [...] are considered to match in
Swedish"
"Match" needs some clarification. In accordance with
Swedish standards, when collating with Swedish locale, all major
implementations match these characters at the first and second
level, but not at a lower level. Thus they are not exact matches:
this might be better phrased in terms of equivalence.
* Section 2.2.3:
"Even if the language is known and language-specific rules
can be defined, dependencies on the language do not disappear"
It is unclear what this means. Could you give an example?
+1
* Section 2.2.1:
"Those characters are not treated as equivalent according
to the Unicode consortium while...".
This is somewhat ad hominem. It should rather be "...according
to the Unicode Standard while..."
ISO 10646 is the only reference
which should be used. The word "Unicode" should not
be used. This is like quoting WGs in an RFC.(Unless there is
an MoU)
* Section 2.2.1:
"..confusion in Germany, where the U+00F8 character is never
used in the language".
That is not true, there are entries with that character in the
Duden dictionary.
* Section 2.2.4:
"This is because [...] some glyphs [...] have been assigned
different code points in Unicode".
This is incorrect: glyphs are not assigned to code points; characters
are.
* Section 2.2.6:
"Is the answer the same for words two [sic] different languages
that translate into each other?".
This is completely orthogonal to IDNs (cf "Is 'cat' the
same as 'gato' or the same as 'katze'?").
+1
* Section 2.2.7:
"the IESG statement [...] that a registry should have a
policy about the scripts, languages, code points and text directions".
This appears to not be an accurate paraphrase of (http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt).
That document rather says a registry "MIGHT want to prevent
particular characters", "MIGHT want to automatically
generate a list of (...) strings and suggest that they also be
registered" and lastly "it is suggested that a registry
act conservatively". There is no such thing as "SHOULD"
wording and, for instance, text direction is not mentioned.
Such a policy has no real interest
anyway as IDNA does not impose that policy on further DN levels.
* Section 2.2.8:
"This maybe [...] because many other applications are internally
sensitive only to the appearance of characters and not to their
representation".
This is reversed. The vast majority of application are internally
sensitive only to the representation, not to the appearance.
Exceptions would be OCR, for example.
+1
* Section 2.2.8:
"A change in a code point assignment (...) may be extremely
disruptive".
This suggests that the consortium capriciously changes code points.
After the merger with ISO 10646 there was only one point at which
the Unicode consortium changed code points: Unicode 2.0.0 (July,
1996): The characters in the Korean Hangul block were moved to
be part of a new, larger block with all 11,152 Hangul syllables.
As a result of
the disruption that this caused, the Unicode Consortium and ISO/IEC
SC2 resolved never to change code points in the future, and no
changes have ever been done since.
* Section 3.1.1:
"...such as code points assigned to font variations...".
Which characters are these referring to? Is it to just characters
that are resolved by an NFKC normalization, or does it refer
to others?
* Section 4.5:
"the whois protocol itself (...) is ASCII-only".
This appears to be inaccurate. The Whois protocol (http://www.ietf.org/rfc/rfc3912.txt?number=3912)
has no mechanisms to indicate which character encoding is being
used, but the protocol is 8-bit clean and it is indeed used so
by many (for instance, DENIC has a UTF-8 implementation up and
running).
+1
To be noted: ML Domain Names are to be much closer to the people.
And therefore to their local laws. Privacy regulations are better
respected in banning current Whois service.
As an addition to these remarks,
I think that the solution to the discussed problems is a character/implementers/language
debate to define a globally supported digital naming acceptable
ISO10646 restriction. The purpose of which not being to support
any name (what IDNA does not do anyway) but to provide a secure
(anti phishing) threehexadecimal network name coding system.
jfc |