Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 6798514
Votes 0
Synopsis Charset UTF-8 accepts CESU-8 codings
Category java:char_encodings
Reported Against
Release Fixed
State 11-Closed, Will Not Fix, bug
Priority: 4-Low
Related Bugs
Submit Date 28-JAN-2009
Description
FULL PRODUCT VERSION :
C:\Programme\Java\jdk1.6.0_03\bin>java -version
java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Client VM (build 1.6.0_03-b05, mixed mode)


ADDITIONAL OS VERSION INFORMATION :
Windows XP SR-2

A DESCRIPTION OF THE PROBLEM :
RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."

Current implementation of UTF-8 is not protected against invalid sequences from "ED A0 80" to "ED BF BF". Surrogate pairs are created instead, like CESU-8 does.

Maybe this is as designed. But at least this should be documented in highlighted position, and created surrogate pairs should be valid.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1.) Decode following byte sequence with UTF-8 decoder: "ED, A0, 80, ED, BF,BF"
2.) Decode following byte sequence with UTF-8 decoder: "ED, BF,BF, ED, A0, 80"


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
1.) CoderResult.isMalformed()
2.) CoderResult.isMalformed()

ACTUAL -
1.) valid surrogate pair: U+D800 + U+DFFF
2.) invalid surrogate pair: U+DFFF + U+D800


REPRODUCIBILITY :
This bug can be reproduced always.
Posted Date : 2009-01-28 10:22:12.0
Work Around
N/A
Evaluation
The latest Unicode recommendation regarding this issue is at
http://www.unicode.org/versions/corrigendum1.html

in which it recommends 

"To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses."

The "non_shortest forms" of supplementary characters are still "allowed" to be decoded (while not be generated in decoding). The UTF-8 charset implementation has been updated recently (#4486841) to follow the recommendation.

The decision for now is that we are not going to udpate the implementation to prohibit the non-shortest forms for supplementary characters. Will reconsider this position should the Standard changes or new security concern raise, in the future.
Posted Date : 2009-03-04 18:23:28.0
Comments
  
  Include a link with my name & email   


PLEASE NOTE: JDK6 is formerly known as Project Mustang