|
Comments
|
Submitted On 08-OCT-2002
verdy_p
More important than the support of additional character
blocks is the support of new properties defined in technical
documents of Unicode 3.2, as they give accurate info that
are NEEDED now for all internationalized applications that
target the existing supported languages. Notably: Chinese,
Japanese, Korean, which don't use the space character and
for which we need a way in the GUI to display text
accurately. The most notable quirks for now occur when
handling multiline display. There's nothing in Java that handle
line-wraps or word wraps correctly (all is based on the
assumption that a space character is enough to delimit "word"
tokens.
We really need new classes to parse and handle correctly the
internationalized text. This is a first step that could be
implemented at least with limited support for CJK languages in
JFC/Swing. A more formal and complete interface could be
defined and discussed later in the core java.lang.Character
classes family, to get access to newly defined character
properties (composition rules of diacritics for
Indic/Thai/Hebrew/Arabic/Vietnamese languages, composition
and input methods for Chinese/Japanese with
Bopomofo/Hiragana/Katakana/Latin and external dictionaries,
canonicalization and equivalences, global default Unicode
collation and ordering, East-Asian width and correspondance
between half-width and full-width CJK characters, character
directionality and BiDi text handling between a logical and
physical encoding form for Hebrew and Arabic, line wrap
opportunities using at least pairs of character classes, and
their consequence on the BiDi algorithm...
Technical papers in Unicode 3.2 often contain sample
template code in C or Java to demonstrate the general
principles of such text handling needed in applications, and
that could then used more generally within existing JFC/Swing
lightweight components or in AWT text components.
Additionally some TrueType/OpenType fonts made for Unicode
are now including such properties (notably in their kerning and
composition tables, such as those provided by the freely
downloadable Microsoft Typeset Library). But general support
of Unicode in GUI applications requires an extended OpenType
API to create logical composite fonts, made of a set of fonts
that each contain a subset of the Unicode character set.
So this RFE targets distincts development paths in the future
design of Tiger:
- core changes in java.lang.Character to handle Unicode
properties
- core changes in java.lang.String to handle character
composition and string composition
- GUI changes in AWT to handle new OpenType font
properties
- GUI changes in JFC/Swing to use the new core
String/Character properties and methods, and the new font
properties, in order to better handle JTextArea and JEditor
components with international text.
Submitted On 26-NOV-2002
verdy_p
Why not basing the Character class on int instead of char ?
Couldn't there be two new constructors such as (int) for
UCS4/UTF-32 and (char, char) or (short, short) for UTF-16
surrogate pairs ?
Adding a new wchar native type would be tricky, as it would
imply change in the JVM, to support new bytecodes. But
extending the Character class (and the String class as well)
should be possible.
This would create two sets of primitives for characters: one
based on char UTF-16 code values, and one based on UCS4
codepoints. They would have distinct constraints (because of
the particular behavior of surrogate pairs).
If modifying the Character class is not possible, may be we
would need a new WCharacter class with similar semantics as
the Character class, except that it would use a int
implementation instead of a char, and that it could provide
additional methods to handle the conversion from a
WCharacter instance to one or two Character instances.
String internal storage do not need to be changed: as they
are immutable, they don't ned to store their internal
characters as ints. However there may be a derived WString
class which contains additional methods to work with
WCharacters instead of Characters.
Submitted On 21-DEC-2002
hjain
Unicode 3.1 needs to be supported, it adds various
code points outside of the Unicode BMP (Basic
Multilingual Plane).
Java, so far, has gotten away with assuming all
characters will have 16-bit representations and
there will be no codings assigned outside of Plane 0.
That has been theoretically false for years but
in practice, it's been true - up until now.
This will now simply not work with Unicode 3.1 and it
will be necessary to add methods to query for
surrogates and get appropriate values (maybe as int)
for higher planes. Java can still use UTF-16 of
course but that does mean that sometimes 2 java
characters will be needed to encode *one* unicode
code point.
The java.lang.Character class needs to be updated and
so do various stream and buffer classes (which currently
simply ignore and/or discard surrogate codings).
Unicode 3.1 is here *today*. Java has to add support
- and the sooner the better.
Submitted On 22-FEB-2003
verdy_p
What would forbid the Character class to contain an int
instead of a char ?
This would keep the API clean, y just adding a new
constructor Character(int) allowing the application to pass
any 21-bit codepoint.
Another interesting constructor would be Character
(char,char) to pass a pair of surrogates.
The Character.methods() that return a char would have to be
extended with Character.methodsEx() returning an int.
This would simplify the programming if this was done in sync
with the String class, which could add a new method
String.getLength() to count actual characters (codepoints)
and not UTF-16 code units. We could have also a new parser
for strings, that would return Character instances and not
chars, and a method like String.characterAt(index) returning a
Character at the char index, and String.characterLen(index)
returning 2 for surrogates.
Internally Strings could be stored now with 24 or 32 bit
codeunits, and the char interface would be emulated. Also,
because Strings are immutable, other internal encodings are
possible, such as the Standard Compression Scheme for
Unicode (SCSU). After all the past UTF-16 encoding is just a
convention but this is not really accessible to the application
which uses String.method() to access its contents. The only
issue is to preserve the legacy UTF-16 encoding for String
*constants* in the compiled binary class, but this can be
handled by the String constructor.
PLEASE NOTE: JDK6 is formerly known as Project Mustang
|