Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 4533872
Votes 2
Synopsis Unicode supplementary character support (JSR-204)
Category java:classes_lang
Reported Against 1.2.2 , tiger , merlin , 1.2beta3 , hopper-beta
Release Fixed 1.5(tiger-beta)
State 10-Fix Delivered, request for enhancement
Priority: 2-High
Related Bugs 4117567 , 4304585 , 4558928 , 4607359 , 4711607 , 4897520 , 4897829 , 4900727 , 4900747 , 4904543 , 4915683 , 4914724 , 4915107
Submit Date 01-DEC-2001
Description
Support Unicode "supplementary characters" (as defined in http://www.unicode.org/glossary/) throughout the J2SE platform. This affects APIs and implementations for character properties, I/O, font rendering, and other areas. Supplementary characters cannot be handled by the existing APIs accepting or returning single 16-bit char values, so a new approach is necessary.

Once support for supplementary characters is defined, the Java platform can be upgraded to support the latest Unicode version (RFE 4640853).
Work Around
N/A
Evaluation
Tiger feature candidate.

This support should probably be at least Unicode 3.2 or whatever the pending Unicode version is at the anticipated release of Tiger.
 xxxxx@xxxxx  2001-12-13


This is an umbrella RFE which includes the following sub-RFEs:

4900727 - Core set API for supplementary character support
4900739 - Supplementary character support in BreakIterator
4859391 - Collation needs to support Unicode 4.0, including supplementary characters
4900747 - Supplementary character support in regular expression package
4888843 - Update Bidi for supplementary chars, Unicode 4.0
4900935 - Update String case conversion to Unicode 4.0
 xxxxx@xxxxx  2003-08-08


4904543 - Supplementary character support in StringTokenizer
 xxxxx@xxxxx  2003-08-11


4915683 - JSR-204: Additional API for supplementary character support
 xxxxx@xxxxx  2003-09-02
Comments
  
  Include a link with my name & email   

Submitted On 08-OCT-2002
verdy_p
More important than the support of additional character 
blocks is the support of new properties defined in technical 
documents of Unicode 3.2, as they give accurate info that 
are NEEDED now for all internationalized applications that 
target the existing supported languages. Notably: Chinese, 
Japanese, Korean, which don't use the space character and 
for which we need a way in the GUI to display text 
accurately. The most notable quirks for now occur when 
handling multiline display. There's nothing in Java that handle 
line-wraps or word wraps correctly (all is based on the 
assumption that a space character is enough to delimit "word" 
tokens.
We really need new classes to parse and handle correctly the 
internationalized text. This is a first step that could be 
implemented at least with limited support for CJK languages in 
JFC/Swing. A more formal and complete interface could be 
defined and discussed later in the core java.lang.Character 
classes family, to get access to newly defined character 
properties (composition rules of diacritics for 
Indic/Thai/Hebrew/Arabic/Vietnamese languages, composition 
and input methods for Chinese/Japanese with 
Bopomofo/Hiragana/Katakana/Latin and external dictionaries, 
canonicalization and equivalences, global default Unicode 
collation and ordering, East-Asian width and correspondance 
between half-width and full-width CJK characters, character 
directionality and BiDi text handling between a logical and 
physical encoding form for Hebrew and Arabic, line wrap 
opportunities using at least pairs of character classes, and 
their consequence on the BiDi algorithm...
Technical papers in Unicode 3.2 often contain sample 
template code in C or Java to demonstrate the general 
principles of such text handling needed in applications, and 
that could then used more generally within existing JFC/Swing 
lightweight components or in AWT text components.
Additionally some TrueType/OpenType fonts made for Unicode 
are now including such properties (notably in their kerning and 
composition tables, such as those provided by the freely 
downloadable Microsoft Typeset Library). But general support 
of Unicode in GUI applications requires an extended OpenType 
API to create logical composite fonts, made of a set of fonts 
that each contain a subset of the Unicode character set.
So this RFE targets distincts development paths in the future 
design of Tiger:
- core changes in java.lang.Character to handle Unicode 
properties
- core changes in java.lang.String to handle character 
composition and string composition
- GUI changes in AWT to handle new OpenType font 
properties
- GUI changes in JFC/Swing to use the new core 
String/Character properties and methods, and the new font 
properties, in order to better handle JTextArea and JEditor 
components with international text.


Submitted On 26-NOV-2002
verdy_p
Why not basing the Character class on int instead of char ?
Couldn't there be two new constructors such as (int) for 
UCS4/UTF-32 and (char, char) or (short, short) for UTF-16 
surrogate pairs ?
Adding a new wchar native type would be tricky, as it would 
imply change in the JVM, to support new bytecodes. But 
extending the Character class (and the String class as well) 
should be possible.
This would create two sets of primitives for characters: one 
based on char UTF-16 code values, and one based on UCS4 
codepoints. They would have distinct constraints (because of 
the particular behavior of surrogate pairs).
If modifying the Character class is not possible, may be we 
would need a new WCharacter class with similar semantics as 
the Character class, except that it would use a int 
implementation instead of a char, and that it could provide 
additional methods to handle the conversion from a 
WCharacter instance to one or two Character instances.

String internal storage do not need to be changed: as they 
are immutable, they don't ned to store their internal 
characters as ints. However there may be a derived WString 
class which contains additional methods to work with 
WCharacters instead of Characters.


Submitted On 21-DEC-2002
hjain
Unicode 3.1 needs to be supported, it adds various
code points outside of the Unicode BMP (Basic
Multilingual Plane).

Java, so far, has gotten away with assuming all
characters will have 16-bit representations and
there will be no codings assigned outside of Plane 0.
That has been theoretically false for years but
in practice, it's been true - up until now.

This will now simply not work with Unicode 3.1 and it
will be necessary to add methods to query for
surrogates and get appropriate values (maybe as int)
for higher planes. Java can still use UTF-16 of
course but that does mean that sometimes 2 java
characters will be needed to encode *one* unicode
code point.

The java.lang.Character class needs to be updated and
so do various stream and buffer classes (which currently
simply ignore and/or discard surrogate codings).

Unicode 3.1 is here *today*. Java has to add support
- and the sooner the better.




Submitted On 22-FEB-2003
verdy_p
What would forbid the Character class to contain an int 
instead of a char ?
This would keep the API clean, y just adding a new 
constructor Character(int) allowing the application to pass 
any 21-bit codepoint.
Another interesting constructor would be Character
(char,char) to pass a pair of surrogates.
The Character.methods() that return a char would have to be 
extended with Character.methodsEx() returning an int.
This would simplify the programming if this was done in sync 
with the String class, which could add a new method 
String.getLength() to count actual characters (codepoints) 
and not UTF-16 code units. We could have also a new parser 
for strings, that would return Character instances and not 
chars, and a method like String.characterAt(index) returning a 
Character at the char index, and String.characterLen(index) 
returning 2 for surrogates.
Internally Strings could be stored now with 24 or 32 bit 
codeunits, and the char interface would be emulated. Also, 
because Strings are immutable, other internal encodings are 
possible, such as the Standard Compression Scheme for 
Unicode (SCSU). After all the past UTF-16 encoding is just a 
convention but this is not really accessible to the application 
which uses String.method() to access its contents. The only 
issue is to preserve the legacy UTF-16 encoding for String 
*constants* in the compiled binary class, but this can be 
handled by the String constructor.



PLEASE NOTE: JDK6 is formerly known as Project Mustang