Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 6795537
Votes 1
Synopsis UTF_8$Decoder returns wrong results
Category java:char_encodings
Reported Against
Release Fixed
State 3-Accepted, bug
Priority: 4-Low
Related Bugs
Submit Date 20-JAN-2009
Description
FULL PRODUCT VERSION :
java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b43)
Java HotSpot(TM) Client VM (build 14.0-b10, mixed mode, sharing)

ADDITIONAL OS VERSION INFORMATION :
Windows XP SR-2

A DESCRIPTION OF THE PROBLEM :
UTF_8$Decoder returns wrong results, see results ...

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
run source

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
new byte[]{(byte)0xC0} ---> CoderResult.malformedForLength(1)
new byte[]{(byte)0xE1, (byte)0x40 ---> CoderResult.malformedForLength(1)
new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1)

ACTUAL -
new byte[]{(byte)0xC0} ---> CoderResult..UNDERFLOW
new byte[]{(byte)0xE1, (byte)0x40 ---> CoderResult.UNDERFLOW
new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(2)


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
package java.nio.charset;

import java.lang.reflect.*;
import java.nio.*;

/**
 *
 * @author Ulf Zibis <Ulf.Zibis at  xxxxx .de>
 */
public class TestCharsetDecoder extends CharsetDecoder {

    public CharsetDecoder decoder;
    public Class<CharsetDecoder> decoderClass;
    public Method decodeArrayLoopMethod;
    public Method decodeBufferLoopMethod;

    public TestCharsetDecoder(CharsetDecoder decoder) throws Exception {
        super(decoder.charset(), decoder.averageCharsPerByte(), decoder.maxCharsPerByte());
        this.decoder = decoder;
        decoderClass = (Class<CharsetDecoder>)Class.forName(decoder.charset().getClass().getName()+"$Decoder");
        decodeArrayLoopMethod = decoderClass.getDeclaredMethod(
                "decodeArrayLoop", ByteBuffer.class, CharBuffer.class);
        decodeArrayLoopMethod.setAccessible(true);
        decodeBufferLoopMethod = decoderClass.getDeclaredMethod(
                "decodeBufferLoop", ByteBuffer.class, CharBuffer.class);
        decodeBufferLoopMethod.setAccessible(true);
    }

    public CoderResult decodeLoop(ByteBuffer in, CharBuffer out) {
        return decoder.decodeLoop(in, out);
    }

    public CoderResult decodeArrayLoop(ByteBuffer in, CharBuffer out) throws Exception {
        return (CoderResult)decodeArrayLoopMethod.invoke(decoder, in, out);
    }

    public CoderResult decodeBufferLoop(ByteBuffer in, CharBuffer out) throws Exception {
        return (CoderResult)decodeBufferLoopMethod.invoke(decoder, in, out);
    }
}

package sun.nio.cs;

import java.io.*;
import java.nio.*;
import java.nio.charset.*;
import java.util.*;
import org.junit.*;
import static org.junit.Assert.*;
import static org.junit.Assume.*;
import org.junit.runner.*;
import org.junit.runners.*;
import org.junit.runners.Parameterized.*;

/**
 *
 * @author Ulf.Zibis @  xxxxx .de
 */
@RunWith(Parameterized.class)
public class UTF_8Test {

    // test parameters:
    private static final CoderResult MAL_1 = CoderResult.malformedForLength(1);
    private static final CoderResult MAL_2 = CoderResult.malformedForLength(2);
    private static final CoderResult MAL_3 = CoderResult.malformedForLength(3);
    private static final CoderResult UFLOW = CoderResult.UNDERFLOW;
    private static final Object[][] PARAMETERS = new Object[][]{
    // samples:
        /*   0 */ { new byte[]{(byte)0xC0, (byte)0x40}, 0, MAL_1 }, // !UTF, 'A'
        /*   1 */ { new byte[]{(byte)0xC1, (byte)0x41}, 0, MAL_1 }, // !UTF, 'B'
        /*   2 */ { new byte[]{(byte)0xC2, (byte)0x42}, 0, MAL_2 }, // UTF_21, 'C'
        /*   3 */ { new byte[]{(byte)0xC3, (byte)0xA0}, 2, UFLOW }, // UTF_21, UTF_22
        /*   4 */ { new byte[]{(byte)0xC0}, 0, MAL_1 }, // !UTF
        /*   5 */ { new byte[]{(byte)0xC1}, 0, MAL_1 }, // !UTF
        /*   6 */ { new byte[]{(byte)0xC2}, 0, UFLOW }, // UTF_21
        /*   7 */ { new byte[]{(byte)0xE0, (byte)0x80, (byte)0x80}, 0, MAL_2 }, // UTF_31, !UTF_32, UTF_33
        /*   8 */ { new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42}, 0, MAL_3 }, // UTF_31, UTF_32, 'C'
        /*   9 */ { new byte[]{(byte)0xE0, (byte)0xA0, (byte)0x80}, 3, UFLOW }, // UTF_31, UTF_32, UTF_33
        /*  10 */ { new byte[]{(byte)0xE1, (byte)0x80, (byte)0x81}, 3, UFLOW }, // UTF_31, UTF_32, UTF_33
        /*  11 */ { new byte[]{(byte)0xE1, (byte)0x40}, 0, MAL_2 }, // UTF_31, 'A'
        /*  12 */ { new byte[]{(byte)0xE0, (byte)0xA0}, 0, UFLOW }, // UTF_31, UTF_32
        /*  13 */ { new byte[]{(byte)0xE1}, 0, UFLOW }, // UTF_31
        /*  14 */ { new byte[]{(byte)0x80, (byte)0x80, (byte)0x80, (byte)0x80, (byte)0x80}, 0, MAL_1 }, // 5 * !UTF
        /*  15 */ { new byte[]{(byte)0xC0, (byte)0xC0, (byte)0xC0, (byte)0xC0, (byte)0xC0}, 0, MAL_1 }, // 5 * !UTF
        /*  16 */ { new byte[]{(byte)0xC0, (byte)0xC0, (byte)0x42}, 0, MAL_1 }, // !UTF, !UTF, 'C'
//        /*   0 */ { new char[] {'\u0041', '\u0042', '\uFFFE'} },
//        /*   1 */ { new char[] {'\u0041', '\u0042', '\uFFFF'} },
    };
    private static final int PROCESS_SINGLE = -1;
    private static int parametersInTest = 0;
    // parameters:
    private static Charset cs;
    private static ByteBuffer inBytes;
    private static int expectedPos;
    private static CoderResult expected;
//    private static CharBuffer inChars;
    // temp:
    private TestCharsetDecoder decoder;
//    private CharsetEncoder encoder;
    boolean ok = false;
    // results:
    private CharBuffer outChars;
//    private ByteBuffer outBytes;

    public UTF_8Test(byte[] inBytes, int position, CoderResult result) throws IOException {
        byte[] previous = null;
        if (this.inBytes != null)
            previous = this.inBytes.array();

        cs = Charset.forName("UTF-8");
        this.inBytes = ByteBuffer.wrap(inBytes);
//        this.inChars = CharBuffer.wrap(inChars);
        expectedPos = position;
        expected = result;
        if (previous != inBytes) {
//            System.out.println("\n>>> PARAMETERS["+(parametersInTest++)+"]: "+Arrays.toString(inBytes));
            System.out.print("\n>>> PARAMETERS["+(parametersInTest++)+"]: [");
            for (int i=0; i<inBytes.length; i++)
                System.out.print(Integer.toBinaryString(inBytes[i]&0xFF)+(i<inBytes.length-1 ? ", " :""));
            System.out.println("]");
//            System.out.printf(" [%h, %h]%n", new Integer(3), new Integer(-3));
//            System.out.printf(" [%X, %X, %X, %X, %X]%n", inBytes);
//            System.out.printf(" [%X, %X, %X, %X, %X]%n", inBytes[0], inBytes[1]);
        }
    }

    @Parameters
    public static Collection data() {
        List parameters = new ArrayList(Arrays.asList(PROCESS_SINGLE < 0 ?
            PARAMETERS : new Object[]{PARAMETERS[PROCESS_SINGLE]}));
        return parameters;
    }

    @Before
    public void setUp() throws Exception {
        decoder = new TestCharsetDecoder(cs.newDecoder());
        outChars = CharBuffer.allocate((int)(inBytes.capacity()*decoder.maxCharsPerByte()));
        inBytes.rewind();
//        assumeTrue(cs.canEncode());
//        encoder = cs.newEncoder();
//        outBytes = ByteBuffer.allocate((int)(inChars.capacity()*encoder.maxBytesPerChar()));
    }

    @After
    public void tearDown() throws Exception {
        System.out.println(ok ? " > OK" : "");
    }

    @Ignore
    @Test
    public void testDecodeArrayLoop() throws Exception {
        System.out.print("NOW TEST > ["+decoder.decoderClass.getName()+"] decodeArrayLoop");
        CoderResult result = null;
        result = decoder.decodeArrayLoop(inBytes, outChars);
        int position = inBytes.position();
        assertEquals("position", expectedPos, position);
        assertEquals("decodeArrayLoop()", expected, result);
        if (result.isOverflow())
            result.throwException();
        ok = true;
    }

    @Ignore
    @Test
    public void testDecodeBufferLoop() throws Exception {
        System.out.print("NOW TEST > ["+decoder.decoderClass.getName()+"] decodeBufferLoop");
        CoderResult result = null;
        result = decoder.decodeBufferLoop(inBytes, outChars);
        int position = inBytes.position();
        assertEquals("position", expectedPos, position);
        assertEquals("decodeBufferLoop()", expected, result);
        if (result.isOverflow())
            result.throwException();
        ok = true;
    }

    @Ignore
    @Test
    public void testDecodeLoop() throws CharacterCodingException {
        System.out.print("NOW TEST > ["+decoder.decoderClass.getName()+"] decodeLoop");
        CoderResult result = null;
        result = decoder.decodeLoop(inBytes, outChars);
        int position = inBytes.position();
        assertEquals("position", expectedPos, position);
        assertEquals("decodeLoop()", expected, result);
        if (result.isOverflow())
            result.throwException();
        ok = true;
    }

    @Test
    public void testDecoder() throws CharacterCodingException {
        System.out.print("NOW TEST > ["+decoder.decoderClass.getName()+"] decode");
        CoderResult result = null;
        result = decoder.decode(inBytes, outChars, false);
        int position = inBytes.position();
        assertEquals("position", expectedPos, position);
//        assertEquals("decode()", expected, result); // for pessimistic estimation
        if (expected.isMalformed())
            assertEquals("decode()", MAL_1, result);
        else
            assertEquals("decode()", expected, result);
        if (result.isOverflow())
            result.throwException();
//        inBytes.rewind();
//        outChars.clear();
        result = decoder.decode(inBytes, outChars, true);
        position = inBytes.position();
        assertEquals("position", expectedPos, position);
        if (expected.isUnderflow() && inBytes.hasRemaining())
            assertEquals("decode()", CoderResult.malformedForLength(inBytes.remaining()), result);
        else if (expected.isMalformed())
            assertEquals("decode()", MAL_1, result);
        else
            assertEquals("decode()", expected, result);
        if (result.isOverflow())
            result.throwException();
        ok = true;
    }
}

---------- END SOURCE ----------
Posted Date : 2009-01-20 08:34:38.0
Work Around
N/A
Evaluation
(1) & (2) are the cases that the implementation checks the sl/sp first then
the validation. It would be ended of a "malformed" CR if there were bytes after the leading byte (with the assumption that we are going to have a possible 2 bytes form, then the validation fails...) . It looks like the current implementation brings us better performance and the fact that we have been doing this for years (read compatibility) I would prefer to keep it as-is.
new byte[]{(byte)0xC0} ---> CoderResult.malformedForLength(1) vs
new byte[]{(byte)0xC0} ---> CoderResult..UNDERFLOW

(3)
new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1)
vs
new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --->

we had similar discussion before regarding the "legnth" of malformed.
Posted Date : 2009-01-20 18:11:14.0
Comments
  
  Include a link with my name & email   

Submitted On 21-JAN-2009
UlfZibis
For (1) & (2) I have simple patch:
I additionally assume, it would be little faster, because
    else if (b1 < -0x3e)
should be little faster than
    else if ((b1 >> 5) == -2)
and
    if (isMalformed2(b1, b2))
could be replaced by
    if (isNotContinuation(b2))

        private CoderResult decodeArrayLoop(ByteBuffer src, CharBuffer dst) {

            final byte[] sa = src.array();
            int sp = src.arrayOffset() + src.position();
            final int sr = src.remaining();
            final int sl = sp + sr;

            final char[] da = dst.array();
            int dp = dst.arrayOffset() + dst.position();
            final int dr = dst.remaining();
            final int dl = dp + dr;

            final int slASCII = sp + sr < dr ? sr : dr;

            // ASCII only loop
            while (sp < slASCII && sa[sp] >= 0)
                da[dp++] = (char)sa[sp++];

            while (sp < sl) {
                final int b1 = sa[sp];
                if (b1 >= 0) {
                    ...
                } else if (b1 < -0x3e) {
                    return malformed(src, sp, dst, dp, 1);
                } else if (b1 < -0x20) {
                    ...
                    if (isNotContinuation(b2))
                        return malformed(src, sp, dst, dp, 2);
                    ...
                } else if (b1 < -0x10) {
                    ...
                } else if (b1 < -0x08) {
                    ...
                } else
                    return malformed(src, sp, dst, dp, 1);
            }
            return xflow(src, sp, sl, dst, dp, 0);
        }

I'm looking further for more enhancements:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/UTF_8.java?view=log


Submitted On 21-JAN-2009
UlfZibis
> we had similar discussion before regarding the "legnth" of malformed.
Yes, I remember well. But at that time you argued contrary. Martin said, the first value could be damaged by cosmic ray, so 2nd value could be ok, which would advise to return length 1.
Here in UTF-8 case you return length > 1, same as I argued from the beginning.


Submitted On 21-JAN-2009
UlfZibis
> the fact that we have been doing this for years (read compatibility)
I'm afraid, this is not true. Run my test on JDK 6 (I used u3). You will see the difference on sample 7.


Submitted On 31-JAN-2009
UlfZibis
I've updated my UTF_8Test, see:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/test/sun/nio/cs/?diff_format=s&rev=609


Submitted On 02-FEB-2009
UlfZibis
I have discovered some more wrong results, and repaired most of them :
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/test/sun/nio/cs/?diff_format=s&rev=617

Run UTF_8Test

I also could enhance speed by ~20% on my machine.
Try UTF_8Benchmark


Submitted On 03-FEB-2009
UlfZibis
Results on my machine for 4-byte decoding (note gain against current UTF_8_70$Decoder):
time for sun.nio.cs.UTF_8_60$Decoder: 2701 ms
time for sun.nio.cs.UTF_8_70$Decoder: 1984 ms
time for sun.nio.cs.UTF_8_new$Decoder: 1720 ms  // gain: 264 ms
time for sun.nio.cs.UTF_8_last$Decoder: 2110 ms


Submitted On 04-FEB-2009
UlfZibis
added mixed 1-byte/2-byte UTF-8 code unit pattern, ratio: 18:1 (realistic for most european languages):
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/test/sun/nio/cs/UTF_8Benchmark.java?diff_format=s&rev=620&view=markup


Submitted On 06-FEB-2009
UlfZibis
I have fixed all wrong results, except for ESU-8 code units.
(See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6798514)

https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/test/sun/nio/cs/?diff_format=s&rev=623


Submitted On 24-FEB-2009
UlfZibis
regarding my 1st comment from 21-JAN-2009:
    else if (b1 < (byte)0xc2)
should be more clear than
    else if (b1 < -0x3e)



PLEASE NOTE: JDK6 is formerly known as Project Mustang