Set the default encoding on the command line to force the use of the old sun.io
EUC-JP-LINUX converter, e.g.,
% java -Dfile.encoding='^AEUC-JP-LINUX' Foo
where ^A represents the ASCII character control-A, i.e., \u0001. This causes
the old sun.io converter for EUC-JP-LINUX to be used whenever the default
encoding is required, thereby preventing the recursive provider lookups which
cause the reported problem.
Note that the system property "file.encoding" is implementation-private. The
redefinition of this property is not, in general, guaranteed to work, and will
likely fail to work in J2SE 1.5 or later releases.
-- ###@###.### 2003/10/5
The analysis given by the submitter is on the right track, but the change
required is more than a simple matter of making the Charset.isSupported and
.forName methods atomic. That would not actually solve the reported problem.
The suggested fix would mask the problem, but would fail in a future release
when the old sun.io converters are removed.
The root cause of this bug is the fact that a platform's default charset cannot
be loaded via the charset-provider mechanism. The default charset is used to
translate filenames from Java UTF-16 strings into platform-specific strings.
The provider mechanism itself needs to translate filenames in order to discover
providers, hence a provider cannot provide the charset which is needed to
discover and load itself. This is why the lookup code in the Charset class
disallows recursive provider lookups.
In 1.4.1 and later releases the EUC-JP-LINUX charset is provided by the
sun.nio.cs.ext.ExtendedCharsets provider. In contexts in which EUC-JP-LINUX is
the default charset (e.g., LC_ALL=ja_JP on Linux) it would seem that this
charset should appear to be unsupported, but in fact it works much of the time.
The reason for this is the existence in the 1.4.x releases of a dual charset
lookup mechanism which falls back to the old sun.io converters when a charset
is not supported by the java.nio.charset APIs.
To see how this works, consider the following example. The evaluation of the
expression "a".getBytes("EUC-JP-LINUX") first causes the code in the internal
java.lang.StringCoding class to invoke the Charset.isSupported() method to see
if that charset is supported. EUC-JP-LINUX is not a standard charset, so the
lookup code in java.nio.charset.Charset tries to look it up via the provider
mechanism. This lookup eventually results in a recursive invocation of the
String.getBytes method on the same thread, this time to encode the filename of
the charsets.jar file into EUC-JP-LINUX (since it's the default charset), which
in turn results in a recursive provider lookup. This fails, since such lookups
are disallowed, hence the String.getBytes method falls back to the old sun.io
EUC-JP-LINUX converter. The initial provider lookup then succeeds, since it
uses the old converter to encode the filename.
On a multiprocessor this scheme can break down if the timing is just right. As
observed by the submitter, the Charset class contains a global cache of the
most recently-returned charset. At the end of the scenario described above
this cache will hold a reference to the EUC-JP-LINUX charset. If one thread
causes the EUC-JP-LINUX charset to be removed from the cache in between another
thread's invocations of the Charset.isSupported and .forName methods during the
recursive provider lookup then an UnsupportedCharsetException will be thrown,
The solution suggested by the submitter will solve the problem, but at the cost
of a synchronization operation and in a way that will fail when the sun.io
converters are removed in a future release. A better solution is to recognize
this fundamental limitation of the charset-provider mechanism and "hardwire"
the ExtendedCharsets provider into the java.nio.charset.Charset lookup logic.
The diffs for this change are in the suggested-fix section of this bug report.
An alternative solution would be to rework the sun.misc.Service code so that it
does not load provider-descriptor files via URLs. It does this only because
that's the only way to load multiple resource files of the same name. Since
charset providers are, by definition, already on the class path, there's really
no need to do another permission check on each provider-description file as is
currently done by the clumsy JarURLConnection code. This solution would,
however, most likely require a more complex and risky set of changes, to the
Service, JarURLConnection, and (possibly) java.lang.ClassLoader classes, hence
it is not proposed here.
-- ###@###.### 2003/10/5