Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 4296969
Votes 4
Synopsis Incorrect behaviours of several character converters
Category java:char_encodings
Reported Against 1.3 , 1.2.2 , kestrel-beta
Release Fixed
State 11-Closed, Will Not Fix, bug
Priority: 4-Low
Related Bugs 4140796 , 4333733 , 4361835 , 4429358 , 4429369 , 4429377
Submit Date 05-DEC-1999
Description





12/5/99   xxxxx@xxxxx   -- kestrel RA produces errors for several of the codepages.  Submitting this to supplement existing encoding bugs open for kestrel.

/*
J:\borsotti\jtest>java -version
java version "1.2"
Classic VM (build JDK-1.2-V, native threads)

There are several problems with the character converters.
They can be summarized as follows:

  - converters which are listed in the jdk documentation,
    but do not exist,
  - converters which do not map all Unicode characters, or
    do not decode back (to) what they encoded,
  - converters which crash

This java program tests each converter in turn and reports
the errors found:
*/

import java.io.*;
import java.util.*;
public class EncErr {

    /**
     * This is the list of encodings reported in
     *
	http://java.sun.com/products/jdk/1.2/docs/guide/internat/encoding.doc.html
     */

    private static String[] encodings = new String[] {
         "ASCII",            // ASCII
         "ISO8859_1",        // ISO 8859-1
         "ISO8859_2",        // ISO 8859-2
         "ISO8859_3",        // ISO 8859-3
         "ISO8859_4",        // ISO 8859-4
         "ISO8859_5",        // ISO 8859-5
         "ISO8859_6",        // ISO 8859-6
         "ISO8859_7",        // ISO 8859-7
         "ISO8859_8",        // ISO 8859-8
         "ISO8859_9",        // ISO 8859-9
         "Big5",             // Big5, Traditional Chinese
         "Cp037",            // USA, Canada(Bilingual, French), Netherlands, Portugal, Brazil, Australia
         "Cp1006",           //  customer  AIX Pakistan (Urdu)
         "Cp1025",           //  customer  Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYR)
         "Cp1026",           //  customer  Latin-5, Turkey
         "Cp1046",           //  customer  Open Edition US EBCDIC
         "Cp1097",           //  customer  Iran(Farsi)/Persian
         "Cp1098",           //  customer  Iran(Farsi)/Persian (PC)
         "Cp1112",           //  customer  Latvia, Lithuania
         "Cp1122",           //  customer  Estonia
         "Cp1123",           //  customer  Ukraine
         "Cp1124",           //  customer  AIX Ukraine
         "Cp1250",           // Windows Eastern European
         "Cp1251",           // Windows Cyrillic
         "Cp1252",           // Windows Latin-1
         "Cp1253",           // Windows Greek
         "Cp1254",           // Windows Turkish
         "Cp1255",           // Windows Hebrew
         "Cp1256",           // Windows Arabic
         "Cp1257",           // Windo",ws Baltic
         "Cp1258",           // Windows Vietnamese
         "Cp1381",           //  customer  OS/2, DOS People's Republic of China (PRC)
         "Cp1383",           //  customer  AIX People's Republic of China (PRC)
         "Cp273",            //  customer  Austria, Germany
         "Cp277",            //  customer  Denmark, Norway
         "Cp278",            //  customer  Finland, Sweden
         "Cp280",            //  customer  Italy
         "Cp284",            //  customer  Catalan/Spain, Spanish Latin America
         "Cp285",            //  customer  United Kingdom, Ireland
         "Cp297",            //  customer  France
         "Cp33722",          //  customer -eucJP - Japanese (superset of 5050)
         "Cp420",            //  customer  Arabic
         "Cp424",            //  customer  Hebrew
         "Cp437",            // MS-DOS United States, Australia, New Zealand, South Africa
         "Cp500",            // EBCDIC 500V1
         "Cp737",            // PC Greek
         "Cp775",            // PC Baltic
         "Cp838",            //  customer  Thailand extended SBCS
         "Cp850",            // MS-DOS Latin-1
         "Cp852",            // MS-DOS Latin-2
         "Cp855",            //  customer  Cyrillic
         "Cp857",            //  customer  Turkish
         "Cp860",            // MS-DOS Portuguese
         "Cp861",            // MS-DOS Icelandic
         "Cp862",            // PC Hebrew
         "Cp863",            // MS-DOS Canadian French
         "Cp864",            // PC Arabic
         "Cp865",            // MS-DOS Nordic
         "Cp866",            // MS-DOS Russian
         "Cp868",            // MS-DOS Pakistan
         "Cp869",            //  customer  Modern Greek
         "Cp870",            //  customer  Multilingual Latin-2
         "Cp871",            //  customer  Iceland
         "Cp874",            //  customer  Thai
         "Cp875",            //  customer  Greek
         "Cp918",            //  customer  Pakistan(Urdu)
         "Cp921",            //  customer  Latvia, Lithuania (AIX, DOS)
         "Cp922",            //  customer  Estonia (AIX, DOS)
         "Cp930",            // Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
         "Cp933",            // Korean Mixed with 1880 UDC, superset of 5029
         "Cp935",            // Simplified Chinese Host mixed with 1880 UDC, superset of 5031
         "Cp937",            // Traditional Chinese Host miexed with 6204 UDC, superset of 5033
         "Cp939",            // Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
         "Cp942",            // Japanese (OS/2) superset of 932
         "Cp948",            // OS/2 Chinese (Taiwan) superset of 938
         "Cp949",            // PC Korean
         "Cp950",            // PC Chinese (Hong Kong, Taiwan)
         "Cp964",            // AIX Chinese (Taiwan)
         "Cp970",            // AIX Korean
         "EUC_CN",           // GB2312, EUC encoding, Simplified Chinese
         "EUC_JP",           // JIS0201, 0208, 0212, EUC Encoding, Japanese
         "EUC_KR",           // KS C 5601, EUC Encoding, Korean
         "EUC_TW",           // CNS11643 (Plane 1-3), T. Chinese, EUC encoding
         "GBK",              // GBK, Simplified Chinese
         "ISO2022CN",        // ISO 2022 CN, Chinese
         "ISO2022CN_CNS",    // CNS 11643 in ISO-2022-CN form, T. Chinese
         "ISO2022CN_GB",     // GB 2312 in ISO-2022-CN form, S. Chinese
         "ISO2022JP",        // JIS0201, 0208, 0212, ISO2022 Encoding, Japanese
         "ISO2022KR",        // ISO 2022 KR, Korean
         "JIS0201",          // JIS 0201, Japanese
         "JIS0208",          // JIS 0208, Japanese
         "JIS0212",          // JIS 0212, Japanese
         "KOI8_R",           // KOI8-R, Russian
         "MS874",            // Windows Thai
         "MacArabic",        // Macintosh Arabic
         "MacCentralEurope", // Macintosh Latin-2
         "MacCroatian",      // Macintosh Croatian
         "MacCyrillic",      // Macintosh Cyrillic
         "MacDingbat",       // Macintosh Dingbat
         "MacGreek",         // Macintosh Greek
         "MacHebrew",        // Macintosh Hebrew
         "MacIceland",       // Macintosh Iceland
         "MacRoman",         // Macintosh Roman
         "MacRomania",       // Macintosh", Romania
         "MacSymbol",        // Macintosh Symbol
         "MacThai",          // Macintosh Thai
         "MacTurkish",       // Macintosh Turkish
         "MacUkraine",       // Macintosh Ukraine
         "SJIS",             // Shift-JIS, Japanese
         "UTF8",             // UTF-8
         };

    /**
     * Test an encoding. The following tests are done:
     * <ol>
     * <li>the existence of the encoder
     * <li>the existence of the decoder
     * <li>each character which is defined in Unicode is encoded,
     *   and then the result is decoded. The number of characters which
     *   are not encoded, or an encoded into an empty sequence of octects,
     *   or are encoded into a sequence which, once decoded, produces
     *   a character different from the original one or different from
     *   '?' is rekoned.
     * <li>several long strings are encoded and then decoded, and checked
     *   to be equal (apart from characters mapped into '?') to the original.
     * </ol>
     * The third and fourth steps are done only if the previous are successful.
     * In the last step, only characters which are encoded correctly are
     * used.
     *
     * @param      enc name of the encoding
     */

    private static void test(String enc){
        System.err.println("------ test ------- " + enc);

        // test existence of encoder

        boolean both = true;
        try {
            byte[] bb = new byte[] {0};
            String str = new String(bb,enc);
        } catch (UnsupportedEncodingException th){
            System.err.println("encoder " + enc + " not available");
            both = false;
        }

        // test existence of decoder

        try {
            byte[] bb = "abc".getBytes(enc);
        } catch (UnsupportedEncodingException th){
            System.err.println("decoder " + enc + " not available");
            both = false;
        }
        if (!both) return;

        // test mapping

        // remember which character is valid for the round-trip test

        boolean[] valid = new boolean[Character.MAX_VALUE+1];
        try {
            int nrEmpty = 0;
            int nrUnmapped = 0;
            int nrNoBack = 0;
            int nrDiffBack = 0;
            for (int c = Character.MIN_VALUE; c <= Character.MAX_VALUE; c++){
                if (!Character.isDefined((char)c)) continue;
                valid[c] = true;
                String s = String.valueOf((char)c);
                byte[] bb = null;
                try {
                    bb = s.getBytes(enc);
                    if (bb.length == 0){
                        nrEmpty++;
                        valid[c] = false;
                        continue;
                    }
                } catch (InternalError tr){
                    nrUnmapped++;
                    valid[c] = false;
                    continue;
                }
                try {
                    String str = new String(bb,enc);
                    if (str.length() != 1){
                        nrNoBack++;
                        valid[c] = false;
                        continue;
                    }
                    if ((str.charAt(0) != (char)c) &&
                       (str.charAt(0) != '?')){
                        nrDiffBack++;
                        valid[c] = false;
                        continue;
                    }
                } catch (InternalError tr){
                    nrNoBack++;
                }
            }
            if (nrUnmapped > 0){
                System.err.println(enc + " has " + nrUnmapped + " unmapped characters");
            }
            if (nrEmpty > 0){
                System.err.println(enc + " has " + nrEmpty + " empty mapped characters");
            }
            if (nrNoBack > 0){
                System.err.println(enc + " does not convert back " + nrNoBack + " characters");
            }
            if (nrDiffBack > 0){
                System.err.println(enc + " converts back " + nrDiffBack + " characters into a different one");
            }
            if (nrDiffBack > Character.MAX_VALUE / 2) return;
        } catch (Throwable th){
            System.err.println("encoding " + enc + " mapping error " + th);
            th.printStackTrace(System.err);
        }

        // test round-trip

        trip: for (int k = 0; k < 100; k++){
            byte[] bb = null;
            char[] ca = new char[10000];
            Random r = new Random();
            for (int i = 0; i < ca.length; i++){
                do {
                    ca[i] = (char)r.nextInt(Character.MAX_VALUE);
                } while (!valid[ca[i]]);
            }
            String old = String.valueOf(ca);
            try {
                bb = old.getBytes(enc);
                if (bb == null){
                    System.err.println(enc + " empty encoding");
                    return;
                }
            } catch (InternalError th){
                System.err.println(enc + " round-trip decoding error");
                break trip;
            } catch (UnsupportedEncodingException th){
            }
            try {
                String str = new String(bb,enc);
                if (!old.equals(str)){
                    if (old.length() != str.length()){
                        System.err.println("encoding " + enc +
                            " round-trip " + old.length() +
                            " back to " + str.length());
                        break trip;
                    }
                    for (int i = 0; i < ca.length && i < str.length(); i++){
                        if ((old.charAt(i) != str.charAt(i)) &&
                            (str.charAt(i) != '?')){
                            System.err.println(enc + " round-trip compare error");
                            break trip;
                        }
                    }
                }
            } catch (InternalError th){
                System.err.println(enc + " round-trip encoding error ");
                break trip;
            } catch (UnsupportedEncodingException th){
            }
        }
    }

    /**
     * Tests all encodings. On all encodings the tests defined above
     * are performed. Moreover, some specific tests are done on ISO2022CN
     * and ISO2022KR.
     */

    public static void main(String[] args){

        for (int i = 0; i < encodings.length; i++){
            test(encodings[i]);
        }

        try {
            byte[] bb = new byte[] {(byte)0x1b, (byte)')',  (byte)'x'};
            String str = new String(bb,"ISO2022CN");
        } catch (Throwable th){
            System.err.println("ISO2022CN error " + th);
        }

        try {
            byte[] bb = new byte[] {(byte)0x1b, (byte)')',  (byte)'x'};
            String str = new String(bb,"ISO2022KR");
        } catch (Throwable th){
            System.err.println("ISO2022KR error " + th);
        }

    }
}

/*
When run, it reports a considerable amount of errors.

Feel free to use it, and include in your test suite if you
like.
*/
(Review ID: 98558) 
======================================================================




java version "1.2.2"
HotSpot VM (1.0.1, mixed mode, build g)

When using the String to convert from native encodings to unicode and back
again, different encodings behave erratically when dealing with characters for
which there is not a direct match.  Specifically, some encodings indicate a
mismatch with by mapping the character to '\u003F', '\u001A', or even no
character ''.  What is worse is that within a single encoding, multiple methods
are used.  In some cases, conversions throw undocumented exceptions.  The worst
behavior is when a conversion from unicode to byte and back again does not
generate an 'unkown' mapping or an exception, but maps to an entirely different
character.

My general technique for identifing these bugs was to step through all the
unicode characters for every encoding, and document the results.  For each
character, I'd convert from unicode to byte, and then from byte back to unicode.

'EX1' indicates an error converting from unicode to byte[].
'EX2' indicates an error converting from byte[] to unicode.
'UN1' indicates a mapping to the '\u003F' character.
'UN2' indicates a mapping to the '
\u001A' character.
'UN3' indicates a mapping to no character ('').
'MIS' indicates a mapping from one character to an entirely different character
(other than UN1 or UN2).

(For the MacDingbat encoding, every mismatch mapping was to '\u271F'.)

For 8859_1,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_2,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_3,	  EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_4,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_5,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_6,	  EX1 = 0, EX2 = 0, UN1 = 64300, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_7,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_8,	  EX1 = 0, EX2 = 0, UN1 = 64293, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_9,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Big5,	  EX1 = 1024, EX2 = 0, UN1 = 50680, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteBig5]

For CNS11643,	  EX1 = 0, EX2 = 0, UN1 = 47696, UN2 = 0, UN3 = 1, MIS = 0.
For Cp037,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1006,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1025,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1026,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1046,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1097,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1098,	  EX1 = 0, EX2 = 0, UN1 = 64258, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1112,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1122,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1123,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1124,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1250,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1251,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1252,	  EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1253,	  EX1 = 0, EX2 = 0, UN1 = 64272, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1254,	  EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1255,	  EX1 = 0, EX2 = 0, UN1 = 64284, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1256,	  EX1 = 0, EX2 = 0, UN1 = 64263, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1257,	  EX1 = 0, EX2 = 0, UN1 = 64267, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1258,	  EX1 = 0, EX2 = 0, UN1 = 64264, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1381,	  EX1 = 0, EX2 = 0, UN1 = 55022, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1383,	  EX1 = 0, EX2 = 0, UN1 = 55517, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp273,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp277,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp278,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp280,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp284,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp285,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp297,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp33722,	  EX1 = 0, EX2 = 0, UN1 = 55140, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp420,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64263, UN3 = 1024, MIS = 0.
For Cp424,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64293, UN3 = 1024, MIS = 0.
For Cp437,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp500,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp737,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp775,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp838,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64260, UN3 = 1024, MIS = 0.
For Cp850,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp852,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp855,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp857,	  EX1 = 0, EX2 = 0, UN1 = 64258, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp860,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp861,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp862,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp863,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp864,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp865,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp866,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp868,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp869,	  EX1 = 0, EX2 = 0, UN1 = 64264, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp870,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp871,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp874,	  EX1 = 0, EX2 = 0, UN1 = 64291, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp875,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64261, UN3 = 1024, MIS = 0.
For Cp918,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp921,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp922,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp930,	  EX1 = 11635, EX2 = 0, UN1 = 52648, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp930]

For Cp933,	  EX1 = 10888, EX2 = 0, UN1 = 53406, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp933]

For Cp935,	  EX1 = 9356, EX2 = 0, UN1 = 54990, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp935]

For Cp937,	  EX1 = 20075, EX2 = 0, UN1 = 44273, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp937]

For Cp939,	  EX1 = 11635, EX2 = 0, UN1 = 52648, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp939]

For Cp942,	  EX1 = 0, EX2 = 0, UN1 = 55170, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp948,	  EX1 = 0, EX2 = 0, UN1 = 44305, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp949,	  EX1 = 0, EX2 = 0, UN1 = 54144, UN2 = 0, UN3 = 1024, MIS = 130.
For Cp950,	  EX1 = 0, EX2 = 0, UN1 = 44308, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp964,	  EX1 = 0, EX2 = 0, UN1 = 44278, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp970,	  EX1 = 0, EX2 = 0, UN1 = 55819, UN2 = 0, UN3 = 1024, MIS = 122.
For EUCJIS,	  EX1 = 1024, EX2 = 0, UN1 = 51372, UN2 = 0, UN3 = 0, MIS = 2.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteEUC_JP]

For GB2312,	  EX1 = 1024, EX2 = 0, UN1 = 56938, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteEUC_CN]

For GBK,	  EX1 = 1024, EX2 = 0, UN1 = 40443, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteGBK]

For ISO2022CN_CNS,	  EX1 = 7650, EX2 = 57885, UN1 = 0, UN2 = 0, UN3 = 0,
MIS = 0.
	Exc1: [java.lang.ArrayIndexOutOfBoundsException]
	Exc2: [java.io.UnsupportedEncodingException: ISO2022CN_CNS]

For ISO2022CN_GB,	  EX1 = 0, EX2 = 65535, UN1 = 0, UN2 = 0, UN3 = 0, MIS =
0.
	Exc2: [java.io.UnsupportedEncodingException: ISO2022CN_GB]

For ISO2022KR,	  EX1 = 0, EX2 = 8224, UN1 = 0, UN2 = 0, UN3 = 57186, MIS = 0.
	Exc2: [java.lang.NullPointerException]

For JIS,	  EX1 = 1024, EX2 = 0, UN1 = 57439, UN2 = 0, UN3 = 3, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteISO2022JP]

For JIS0208,	  EX1 = 1024, EX2 = 0, UN1 = 0, UN2 = 0, UN3 = 57632, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteJIS0208]

For KOI8_R,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For KSC5601,	  EX1 = 1024, EX2 = 0, UN1 = 56159, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteEUC_KR]

For MS874,	  EX1 = 0, EX2 = 0, UN1 = 64287, UN2 = 0, UN3 = 1024, MIS = 0.
For MacArabic,	  EX1 = 0, EX2 = 0, UN1 = 64281, UN2 = 0, UN3 = 1024, MIS = 0.
For MacCentralEurope,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
MIS = 0.
For MacCroatian,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
MIS = 0.
For MacCyrillic,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
MIS = 0.
For MacDingbat,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 0, UN3 = 1024, MIS = 64290.
For MacGreek,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For MacHebrew,	  EX1 = 0, EX2 = 0, UN1 = 64297, UN2 = 0, UN3 = 1024, MIS = 0.
For MacIceland,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For MacRoman,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For MacRomania,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For MacSymbol,	  EX1 = 0, EX2 = 0, UN1 = 64311, UN2 = 0, UN3 = 1024, MIS = 0.
For MacThai,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For MacTurkish,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For MacUkraine,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For SJIS,	  EX1 = 1024, EX2 = 0, UN1 = 57439, UN2 = 0, UN3 = 0, MIS = 2.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteSJIS]

For UTF8,	  EX1 = 0, EX2 = 0, UN1 = 1024, UN2 = 0, UN3 = 0, MIS = 0.

I came across this bug while trying to convert between diffent encodings.  I was
trying to get some idea of the data loss, but because so many different methods
are used to indicate 'no mapping', this was made very difficult.  Much of this
would be addressed by bug 4241124.  I also read several bugs indicating that not
all encodings are not 'reversible', which address many of the 'EX2' errors.
However, what I can not understand is how I can map a character from unicode to
byte[] and back to unicode, and get an entirely different character!  This must
be a error in the underlying conversion tables.

I think that in the very least these inconsistencies between encodings should be
documented somewhere.  I had been under the impression that 'no mapping' whould
be indicated by '?' in the native form, and with the SUBSTITUTE character in
unicode.  I was not aware that some characters would be ommitted in the
conversion, that different methods would be used to indicate 'no mapping'
within the same encoding, that all sorts of errors could be generated, or that
conversions were not reversible.
(Review ID: 100000)
======================================================================




java version "1.3.0rc1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0rc1-T)
Java HotSpot(TM) Client VM (build 1.3.0rc1-S, mixed mode)

1. open the command prompt in Korean Windows 2000.
(the default codepage is 949)
2. run below code like this way.
C:> java -Duser.language=en -Duser.region=US -classpath . ShowLocale
import java.util.Locale;
public class ShowLocale {
        public static void main(String[] args) {
                System.out.println("default locale is " + Locale.getDefault());
        }
}
3. then the result is
default locale is ko_KR
I should expect en_US.
4. but if I change the codepage in console prompt like this way,
C:> chcp 1252
then, all works fine.
C:> java -Duser.language=blah -Duser.region=YADDA -classpath . ShowLocale
the result is
default locale is blah_YADDA

This problem happens also in JDK 1.2.2.
(Review ID: 102774)
======================================================================
Posted Date : 2005-07-22 03:26:06.0
Work Around




No way, when a converter does not work there is no way
to make it do.
======================================================================




I will be converting Strings on a character-basis so that I can identify which
characters are 'troublesome', since its not possible to isolate 'mapping'
problems when converting Strings with multiple characters.
(Review ID: 100000)
======================================================================
Evaluation
The test case in the bug description tests primarily round-trip conversion from Unicode to another encoding and back to Unicode. While it is desirable that such a round-trip conversion results in the original character(s), this can not generally be guaranteed. The anomalies reported by the test in some cases indicate real bugs, but in some other cases just reflect the idiosyncrasies of the various encodings we support. This evaluation looks at the reported anomalies case-by-case. The basis is the J2SDK 1.3.0 FCS-P build.

The converters that are not available are ISO2022CN (decoder), ISO2022CN_CNS (encoder), ISO2022CN_GB (encoder). This has already been reported as bug 4140796.

For all encodings, the test produces a complaint like "encoding ASCII round-trip 10000 back to 9997", indicating that round-trip conversion of a 10000-character Unicode strings results in a slightly shorter string. This occurs primarily due to surrogate characters. The char-to-byte conversion in String uses the method CharToByteConverter.convertAny, which skips over any malformed input such as unpaired surrogate characters. If the test is modified to eliminate surrogate characters, replacing the String generation code
            Random r = new Random();
            for (int i = 0; i < ca.length; i++){
                do {
                    ca[i] = (char)r.nextInt(Character.MAX_VALUE);
                } while (!valid[ca[i]]);
            }
with
            char[] ca = new char[10000];
            Random r = new Random();
            for (int i = 0; i < ca.length; i++){
                do {
                    ca[i] = (char)r.nextInt(Character.MAX_VALUE);
                } while (!valid[ca[i]] || (ca[i] >= 0xD800 && ca[i] < 0xE000));
            }
then this error is no longer reported for most converters, the exceptions being Cp933, Cp949, and Cp970.
TO DO: The documentation of String.getBytes should be updated to document the handling of malformed input.

The Cp933, Cp949, and Cp970 char-to-byte converters have code to combine sequences of Jamo in the Hangul Jamo block. In all cases I saw where the test reported "round-trip 10000 back to 9997" or similar, I found sequences of such characters in the input strings. A shortened result of the roundtrip in these cases is to be expected.

Other complaints:

Cp037 converts back 47145 characters into a different one
Cp1025 converts back 47145 characters into a different one
Cp1026 converts back 47145 characters into a different one
Cp1097 converts back 47144 characters into a different one
Cp1112 converts back 47145 characters into a different one
Cp1122 converts back 47145 characters into a different one
Cp1123 converts back 47145 characters into a different one
Cp273 converts back 47145 characters into a different one
Cp277 converts back 47145 characters into a different one
Cp278 converts back 47145 characters into a different one
Cp280 converts back 47145 characters into a different one
Cp284 converts back 47145 characters into a different one
Cp285 converts back 47145 characters into a different one
Cp297 converts back 47145 characters into a different one
Cp420 converts back 47153 characters into a different one
Cp424 converts back 47183 characters into a different one
Cp500 converts back 47145 characters into a different one
Cp838 converts back 47150 characters into a different one
Cp870 converts back 47145 characters into a different one
Cp871 converts back 47145 characters into a different one
Cp875 converts back 47151 characters into a different one
Cp918 converts back 47145 characters into a different one
For some of these EBCDIC converters, \u0085 converts back to \u000A; in all other cases unmapped characters convert back to \u001A.
The char-to-byte converters for these EBCDIC encodings do not set the subBytes to the EBCDIC question mark (0x6F), they retain the default value 0x3F. This byte value in EBCDIC represents the SUB control character, so the byte-to-char converter maps it back to \u001A.
In the case of \u0085, some char-to-byte converters use this to express the EBCDIC NL control character, while the corresponding byte-to-char converters treat NL as a synonym of LF.

Cp1381 converts back 3 characters into a different one
\u00B7 converts back to \u30FB
\u2014 converts back to \u2015
\u7AC2 converts back to \u30FB

Cp1383 converts back 7 characters into a different one
\u001A converts back to \u00A3
\u00B7 converts back to \u30FB
\u2014 converts back to \u2015
\u50FF converts back to \u00A3
\u8EA2 converts back to \u30FB
\uF83D converts back to \uFFE5
\uF83E converts back to \u4EDD

Cp33722 converts back 52 characters into a different one
\u2015 converts back to \u2014
\u2225 converts back to \u2016
\u4FE0 converts back to \u4FA0
\u525D converts back to \u5265
\u551E converts back to \u8749
\u555E converts back to \u5516
\u5699 converts back to \u565B
\u56CA converts back to \u56A2
\u5861 converts back to \u586B
\u5C5B converts back to \u5C4F
\u5C62 converts back to \u5C61
\u6414 converts back to \u63BB
\u6451 converts back to \u63B4
\u6522 converts back to \u6505
\u6805 converts back to \u67F5
\u688E converts back to \u688D
\u6D00 converts back to \u6D9C
\u6F1E converts back to \u9A28
\u6F51 converts back to \u6E8C
\u7006 converts back to \u6D9C
\u70FF converts back to \u4FA0
\u7130 converts back to \u7114
\u7626 converts back to \u75E9
\u79B1 converts back to \u7977
\u7C1E converts back to \u7BAA
\u7E48 converts back to \u7E66
\u7E61 converts back to \u7E4D
\u7E6B converts back to \u7E4B
\u8141 converts back to \u80FC
\u8346 converts back to \u834A
\u840A converts back to \u83B1
\u8523 converts back to \u848B
\u8741 converts back to \u5516
\u87EC converts back to \u8749
\u881F converts back to \u874B
\u8EC0 converts back to \u8EAF
\u8F91 converts back to \u2116
\u91AC converts back to \u91A4
\u91B1 converts back to \u9197
\u92CA converts back to \u565B
\u9830 converts back to \u982C
\u9839 converts back to \u983D
\u985A converts back to \u985B
\u9A52 converts back to \u9A28
\u9DD7 converts back to \u9D0E
\u9E7C converts back to \u9E78
\u9EB4 converts back to \u9EB9
\u9EB5 converts back to \u9EBA
\uF86F converts back to \u2116
\uFF0D converts back to \u2212
\uFF5E converts back to \u301C
\uFFE4 converts back to \uFFFD

Cp930 does not convert back 2 characters
\u000E maps back to empty string
\u000F maps back to empty string

Cp930 converts back 52 characters into a different one
\u0085 converts back to \u000A
\u00A6 converts back to \uFFE4
\u2014 converts back to \u2015
\u2016 converts back to \u2225
\u2212 converts back to \uFF0D
\u301C converts back to \uFF5E
\u4FE0 converts back to \u4FA0
\u525D converts back to \u5265
\u555E converts back to \u5516
\u5699 converts back to \u565B
\u56CA converts back to \u56A2
\u5861 converts back to \u586B
\u5C5B converts back to \u5C4F
\u5C62 converts back to \u5C61
\u6414 converts back to \u63BB
\u6451 converts back to \u63B4
\u6522 converts back to \u6505
\u688E converts back to \u688D
\u6BE1 converts back to \u5516
\u6D00 converts back to \u6D9C
\u6F51 converts back to \u6E8C
\u7006 converts back to \u6D9C
\u70FF converts back to \u4FA0
\u7130 converts back to \u7114
\u7626 converts back to \u75E9
\u79B1 converts back to \u7977
\u7C1E converts back to \u7BAA
\u7E48 converts back to \u7E66
\u7E61 converts back to \u7E4D
\u7E6B converts back to \u7E4B
\u8141 converts back to \u80FC
\u840A converts back to \u83B1
\u841D converts back to \u8749
\u841F converts back to \u874B
\u8523 converts back to \u848B
\u87EC converts back to \u8749
\u881F converts back to \u874B
\u8EC0 converts back to \u8EAF
\u8F91 converts back to \u2116
\u91AC converts back to \u91A4
\u91B1 converts back to \u9197
\u92CA converts back to \u565B
\u9830 converts back to \u982C
\u9839 converts back to \u983D
\u985A converts back to \u985B
\u9A52 converts back to \u9A28
\u9B7E converts back to \u9A28
\u9DD7 converts back to \u9D0E
\u9E7C converts back to \u9E78
\u9EB4 converts back to \u9EB9
\u9EB5 converts back to \u9EBA
\uF86F converts back to \u2116
Except for \u0085, these are all cases where two Unicode characters map to the same Cp939 byte sequence, as specified in either the official IBM mapping table or in the request for additional characters for Microsoft compatibility in RFE 4199599.

Cp933 has 10887 unmapped characters
InternalError thrown in char-to-byte conversion
Cp933 does not convert back 2 characters
\u000E maps back to empty string
\u000F maps back to empty string

Cp935 does not convert back 2 characters
\u000E maps back to empty string
\u000F maps back to empty string
Cp935 converts back 1 characters into a different one
\u0085 converts back to \u000A

Cp937 does not convert back 2 characters
\u000E maps back to empty string
\u000F maps back to empty string
Cp937 converts back 1 characters into a different one
\u0085 converts back to \u000A

Cp939 does not convert back 2 characters
\u000E maps back to empty string
\u000F maps back to empty string

Cp939 converts back 52 characters into a different one
\u0085 converts back to \u000A
\u00A6 converts back to \uFFE4
\u2014 converts back to \u2015
\u2016 converts back to \u2225
\u2212 converts back to \uFF0D
\u301C converts back to \uFF5E
\u4FE0 converts back to \u4FA0
\u525D converts back to \u5265
\u555E converts back to \u5516
\u5699 converts back to \u565B
\u56CA converts back to \u56A2
\u5861 converts back to \u586B
\u5C5B converts back to \u5C4F
\u5C62 converts back to \u5C61
\u6414 converts back to \u63BB
\u6451 converts back to \u63B4
\u6522 converts back to \u6505
\u688E converts back to \u688D
\u6BE1 converts back to \u5516
\u6D00 converts back to \u6D9C
\u6F51 converts back to \u6E8C
\u7006 converts back to \u6D9C
\u70FF converts back to \u4FA0
\u7130 converts back to \u7114
\u7626 converts back to \u75E9
\u79B1 converts back to \u7977
\u7C1E converts back to \u7BAA
\u7E48 converts back to \u7E66
\u7E61 converts back to \u7E4D
\u7E6B converts back to \u7E4B
\u8141 converts back to \u80FC
\u840A converts back to \u83B1
\u841D converts back to \u8749
\u841F converts back to \u874B
\u8523 converts back to \u848B
\u87EC converts back to \u8749
\u881F converts back to \u874B
\u8EC0 converts back to \u8EAF
\u8F91 converts back to \u2116
\u91AC converts back to \u91A4
\u91B1 converts back to \u9197
\u92CA converts back to \u565B
\u9830 converts back to \u982C
\u9839 converts back to \u983D
\u985A converts back to \u985B
\u9A52 converts back to \u9A28
\u9B7E converts back to \u9A28
\u9DD7 converts back to \u9D0E
\u9E7C converts back to \u9E78
\u9EB4 converts back to \u9EB9
\u9EB5 converts back to \u9EBA
\uF86F converts back to \u2116
Except for \u0085, these are all cases where two Unicode characters map to the same Cp939 byte sequence, as specified in either the official IBM mapping table or in the request for additional characters for Microsoft compatibility in RFE 4199599.

Cp942 converts back 45 characters into a different one
\u4FE0 converts back to \u4FA0
\u525D converts back to \u5265
\u551E converts back to \u8749
\u555E converts back to \u5516
\u5699 converts back to \u565B
\u56CA converts back to \u56A2
\u5861 converts back to \u586B
\u5C5B converts back to \u5C4F
\u5C62 converts back to \u5C61
\u6414 converts back to \u63BB
\u6451 converts back to \u63B4
\u6522 converts back to \u6505
\u688E converts back to \u688D
\u6D00 converts back to \u6D9C
\u6F1E converts back to \u9A28
\u6F51 converts back to \u6E8C
\u7006 converts back to \u6D9C
\u70FF converts back to \u4FA0
\u7130 converts back to \u7114
\u7626 converts back to \u75E9
\u79B1 converts back to \u7977
\u7C1E converts back to \u7BAA
\u7E48 converts back to \u7E66
\u7E61 converts back to \u7E4D
\u7E6B converts back to \u7E4B
\u8141 converts back to \u80FC
\u840A converts back to \u83B1
\u8523 converts back to \u848B
\u8741 converts back to \u5516
\u87EC converts back to \u8749
\u881F converts back to \u874B
\u8EC0 converts back to \u8EAF
\u8F91 converts back to \u2116
\u91AC converts back to \u91A4
\u91B1 converts back to \u9197
\u92CA converts back to \u565B
\u9830 converts back to \u982C
\u9839 converts back to \u983D
\u985A converts back to \u985B
\u9A52 converts back to \u9A28
\u9DD7 converts back to \u9D0E
\u9E7C converts back to \u9E78
\u9EB4 converts back to \u9EB9
\u9EB5 converts back to \u9EBA
\uF86F converts back to \u2116

Cp949 converts back 129 characters into a different one
\u1100 converts back to \uAC00
\u1101 converts back to \uAE4C
\u1102 converts back to \uB098
\u1103 converts back to \uB2E4
\u1104 converts back to \uB530
\u1105 converts back to \uB77C
\u1106 converts back to \uB9C8
\u1107 converts back to \uBC14
\u1108 converts back to \uBE60
\u1109 converts back to \uC0AC
\u110A converts back to \uC2F8
\u110B converts back to \uC544
\u110C converts back to \uC790
\u110D converts back to \uC9DC
\u110E converts back to \uCC28
\u110F converts back to \uCE74
\u1110 converts back to \uD0C0
\u1111 converts back to \uD30C
\u1112 converts back to \uD558
\u1117 converts back to \uE0D4
\u1118 converts back to \uE320
\u1135 converts back to \u25BC
\u113A converts back to \u3138
\u113B converts back to \u3384
\u1150 converts back to \u63C0
\u1151 converts back to \u660C
\u1154 converts back to \u6CF0
\u1158 converts back to \u7620
\u1159 converts back to \u786C
\u1161 converts back to \uAC00
\u1162 converts back to \uAC1C
\u1163 converts back to \uAC38
\u1164 converts back to \uAC54
\u1165 converts back to \uAC70
\u1166 converts back to \uAC8C
\u1167 converts back to \uACA8
\u1168 converts back to \uACC4
\u1169 converts back to \uACE0
\u116A converts back to \uACFC
\u116B converts back to \uAD18
\u116C converts back to \uAD34
\u116D converts back to \uAD50
\u116E converts back to \uAD6C
\u116F converts back to \uAD88
\u1170 converts back to \uADA4
\u1171 converts back to \uADC0
\u1172 converts back to \uADDC
\u1173 converts back to \uADF8
\u1174 converts back to \uAE14
\u1175 converts back to \uAE30
\u1176 converts back to \uAE4C
\u1177 converts back to \uAE68
\u1178 converts back to \uAE84
\u1179 converts back to \uAEA0
\u117A converts back to \uAEBC
\u117B converts back to \uAED8
\u117C converts back to \uAEF4
\u117D converts back to \uAF10
\u117E converts back to \uAF2C
\u117F converts back to \uAF48
\u1180 converts back to \uAF64
\u1181 converts back to \uAF80
\u1182 converts back to \uAF9C
\u1183 converts back to \uAFB8
\u1184 converts back to \uAFD4
\u1185 converts back to \uAFF0
\u1186 converts back to \uB00C
\u1187 converts back to \uB028
\u1188 converts back to \uB044
\u1189 converts back to \uB060
\u118A converts back to \uB07C
\u118B converts back to \uB098
\u118C converts back to \uB0B4
\u118D converts back to \uB0D0
\u118E converts back to \uB0EC
\u118F converts back to \uB108
\u1190 converts back to \uB124
\u1191 converts back to \uB140
\u1192 converts back to \uB15C
\u1193 converts back to \uB178
\u1194 converts back to \uB194
\u1195 converts back to \uB1B0
\u1196 converts back to \uB1CC
\u1197 converts bac
k to \uB1E8
\u1198 converts back to \uB204
\u1199 converts back to \uB220
\u119A converts back to \uB23C
\u119B converts back to \uB258
\u119C converts back to \uB274
\u119D converts back to \uB290
\u119E converts back to \uB2AC
\u119F converts back to \uB2C8
\u11A0 converts back to \uB2E4
\u11A1 converts back to \uB300
\u11A2 converts back to \uB31C
\u11A8 converts back to \uAC01
\u11A9 converts back to \uAC02
\u11AB converts back to \uAC04
\u11AE converts back to \uAC07
\u11AF converts back to \uAC08
\u11B0 converts back to \uAC09
\u11B1 converts back to \uAC0A
\u11B2 converts back to \uAC0B
\u11B7 converts back to \uAC10
\u11B8 converts back to \uAC11
\u11B9 converts back to \uAC12
\u11BA converts back to \uAC13
\u11BB converts back to \uAC14
\u11BC converts back to \uAC15
\u11BD converts back to \uAC16
\u11BE converts back to \uAC17
\u11C0 converts back to \uAC19
\u11C1 converts back to \uAC1A
\u11C2 converts back to \uAC1B
\u11C3 converts back to \uAC1C
\u11C4 converts back to \uAC1D
\u11C7 converts back to \uAC20
\u11CB converts back to \uAC24
\u11D3 converts back to \uAC2C
\u11D4 converts back to \uAC2D
\u11D6 converts back to \uAC2F
\u11D7 converts back to \uAC30
\u11D8 converts back to \uAC31
\u11DF converts back to \uAC38
\u11E0 converts back to \uAC39
\u11E3 converts back to \uAC3C
\u11E7 converts back to \uAC40
\u11F2 converts back to \uAC4B
\u11F4 converts back to \uAC4D

Cp970 converts back 2480 characters into a different one
Large number maps back to \u25C9, small number to other characters.

EUC_JP converts back 2 characters into a different one
\u00A5 converts back to \u005C, \u203E converts back to \u007E.
See JIS0201.

EUC_TW does not convert back 1 characters
\u0000 maps back to empty string.

ISO2022JP does not convert back 3 characters
\u000E, \u000F, \u001B map back to empty strings.

ISO2022KR has 37512 empty mapped characters
Includes half of high surrogate characters.

ISO2022KR does not convert back 1539 characters
\u000E, \u000F, \u001B map back to empty strings.
Plus half of high and all of low surrogate characters.

JIS0201 converts back 2 characters into a different one
\u00A5 converts back to \u005C, \u203E converts back to \u007E.

JIS0208 does not convert back 40521 characters
They map back to empty string.

JIS0212 does not convert back 41333 characters
They map back to empty string.

MacDingbat converts back 47179 characters into a different one
MacDingbat doesn't have a question mark, so unmapped characters are arbitrarily mapped to 0x3F, which translates back to \u271F.

SJIS converts back 2 characters into a different one
\u00A5 converts back to \u005C, \u203E converts back to \u007E.
See JIS0201.

ISO2022CN error java.lang.ArrayIndexOutOfBoundsException
Is thrown for an invalid escape sequence.

ISO2022KR error java.lang.ArrayIndexOutOfBoundsException
Is thrown for an invalid escape sequence.

  xxxxx@xxxxx   1999-12-08



JIS0201 and JIS0208 are standards that define fewer than 7000 characters.
The test is probably in error when it reports failures to convert back
over 40000 characters.

Using an independent test of JIS0201, JIS0212,
and SJIS, 8 errors were detected:

CHECKING BYTE ARRAY TO STRING
PASS? CODE    IN    CHECK   OUT     COMMENT
FAIL  JIS0201  5C    00A5  005C  # YEN SIGN
FAIL  JIS0201  7E    203E  007E  # OVERLINE
FAIL  JIS0212  2237  007E  FF5E  # TILDE
FAIL  SJIS    5C    00A5  005C  # YEN SIGN
FAIL  SJIS    7E    203E  007E  # OVERLINE
FAIL  SJIS    815F  005C  FF3C  # REVERSE SOLIDUS
CHECKING STRING TO BYTE ARRAY
PASS? CODE    IN    CHECK   OUT     COMMENT
FAIL  JIS0212  007E  2237  7E    # TILDE
FAIL  SJIS    005C  815F  5C    # REVERSE SOLIDUS

Hex under "IN" is input to a conversion.  Hex in the "CHECK" column
is the expected output.  Hex under the "OUT" column is the actual output
using a 1.3 or 1.4 JDK.

The test used a pair of code pages from ftp://www.unicode.org/ .  The
page for JIS0202 is for "JIS X 0201 (1976) to Unicode 1.1".  The
code page for SHIFTJIS says that
#   This table contains the data the Unicode Consortium has on how
#       Shift-JIS (a combination of JIS 0201 and JIS 0208) maps into Unicode"
and is dated 8 March 1994.

  xxxxx@xxxxx   2000-08-07

JIS0208 required a separate test.  The test uses a download from
ftp://www.unicode.org/ and is for "JIS X 0208 (1990)".    This code page
lists conversions for 6879 characters.  There is only one character that
is incorrectly converted.

CHECKING BYTE ARRAY TO STRING
PASS? CODE    IN    CHECK   OUT     COMMENT
FAIL  JIS0208  2140  005C  FF3C  # REVERSE SOLIDUS
CHECKING STRING TO BYTE ARRAY
PASS? CODE    IN    CHECK   OUT     COMMENT
FAIL  JIS0208  005C  2140  5C    # REVERSE SOLIDUS

  xxxxx@xxxxx   2000-08-11

This bug is extremely broad in its analysis of the various character converters.
Also as has been already commented a number of the reported issues are not bugs
but are due to fallback and compatibility mappings which have been requested within some of the converters or due to the original test case having provided
unpaired surrogate/unmappable characters.  However, in the interests of 
more efficient bug management and tracking I have isolated the real issues which need to be addressed in various converters and I have created bugs to track
those issues/errors. Some bugs already exist for these issues.

Here is a concise summary of the issues pertinent to this broad bug report
and associated reference to the new bugIDs.

1. Cp930, Cp933, Cp935, Cp937, Cp939. These EBCDIC based encodings provide
   mappings for U+000E and U+000F back to  native \x0e or \0x0f respectively.
   These mappings should be removed and these should be unmappable chars.
   (see BugID: 4429358)
2. ISO2022CN and ISO2022KR converters throw runtime exception,
   (ArrayOutOfBoundsException on jdk1.3).
   See BugID: 4429369
3. Some of the IBM converters contain char->byte mappings which now appear
   to be obsolete. These should be removed.
   See BugID: 4429377
4. Cp933 throws InternalError if it attempts to encode a solitary character.
   Due to inadequate (4 byte) reservation of space for SI/SO leading/trailing
   bytes in the decoded output.
   see BugID: 4333733
5. SJIS/EUC-JP/JIS0201/JIS0208/SJIS roundtrip issues.
   Have been previously addressed within bug: 4361835.
6. Handling of 'NEL' (U+0085) character conversion within EBCDIC encodings.
   Since 4159519 was fixed to handle new line/line feed handling on
   EBCDIC platforms we have a potential roundtrip code mapping conflict
   issue with the mapping of the control character U+0085. This needs
   to be resolved. BugID: TBD.
7. ISO2022CN (decoder), ISO2022CN_CNS (encoder), ISO2022CN_GB (encoder)
   not supported. Already captured in  4140796.

  Please use the bugIDs above for tracking the remaining identified
  constituent issues. This bug is being closed out in lieu of the
  newly created and existing bugs.

  xxxxx@xxxxx   3/23/2001.
Posted Date : 2005-07-22 03:26:27.0
Comments
  
  Include a link with my name & email   

Submitted On 20-NOV-2000
jondrow
We have an application system requires the round trip conversion results in the original characters.  User info written in native language for 
properties file are converted into UTF8 before application reads it. Then the info is converted back to native language encoding before sent to 
user as email. We'd like  user to receive emails native language encoding other than  in UTF8 format and we do not want to dynamically read 
info from properties file wriiten in native encoding.  This would be a show stopper for our application if the round trip conversion can not result 
in  the original characters



PLEASE NOTE: JDK6 is formerly known as Project Mustang