|
Quick Lists
|
|
Bug ID:
|
4939847
|
|
Votes
|
0
|
|
Synopsis
|
java.net.URI should have loose/tolerant/compatibility option (or allow reuse)
|
|
Category
|
java:classes_net
|
|
Reported Against
|
1.4.2
|
|
Release Fixed
|
|
|
State
|
11-Closed,
Not a Defect,
request for enhancement
|
|
Priority:
|
4-Low
|
|
Related Bugs
|
|
|
Submit Date
|
17-OCT-2003
|
|
Description
|
A DESCRIPTION OF THE REQUEST :
There's the way URIs are officially specified -- and then the way they're used on the net. The class java.net.URI is great, but rejects a lot of URIs as illegal which are in common use and can be successfully handled/visited by popular web browsers.
For example, on one of Sun's own websites...
http://sunsolve.sun.com/pub-cgi/show.pl?target=home
... you can find the link...
http://supportforum.sun.com/cgi-bin/WebX.cgi? xxxxx@xxxxx ^0@/os.sunlinux.general
... which java.net.URI complains is illegal -- due to the '^' character. It's right -- but IE/Mozilla/etc. can successfully follow this link. In fact, they often even keep the unescaped illegal character in the HTTP request URI-line they send.
Any general effort to parse or visit URIs (for analysis, archival, whatever) will run into hundreds of such URIs -- and if java.net.URI chokes on them, the use of this class must eventually be discarded in favor of a looser custom solution.
In creating such a solution, it would be nice to be able to subclass java.net.URI, or otherwise use its existing parsing and character-classification routines (mathod match(), etc.), up to the point where they have to be loosened... but because the class is final, and the relevant methods private, getting partial usefulness out of java.net.URI when it is otherwise too strict is also hard.
JUSTIFICATION :
Such technically-illegal-but-usually-functional URIs appear all over... any moderately sized crawl of the web will encounter hundreds of such URIs. Other dominant net software, such as web browsers, already tolerate such URIs. So to match commonly expected behavior, and support real-world net applications, an option for loosening the URI class's behavior would prove helpful to many internet developers.
There is already precedent for such behavior, in java.net.URI and popular third-party packages. (The comments of java.net.URI make note of a number of necessary deviations from RFC 2396. Also, the customer HTTPClient library can be set to accept cookies which are technically illegal -- because such cookies are quite common.)
The multiple-argument/auto-escaping URI constructors are insufficient, as they would require outside-URI pre-parsing, and still add escaping sugar which common web browsers don't.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Possibilities:
* java.net.URI gets two subclasses, one which is strict and one which is configurably loose (in the path, query, or fragment portions); applications which are willing to be loose always expect the common superclass
* java.net.URI constructors include a strict/loose flag; instances remember if they are strict or loose
In any case, it should be possible to create technically-illegal URIs when such are required by java applications, even though strictness is a customer default.
ACTUAL -
java.net.URI throws URISyntaxExceptions whenever commonly-used and usually-functional but against-the-spec characters appear in strings passed it to construct URIs
---------- BEGIN SOURCE ----------
// this is an actual URI from the msn.com homepage which is
// well-handled by IE/Mozilla/etc -- and sent in HTTP requests
// without escaping -- but impossible to shoehorn into a
// java.net.URI
java.net.URI u = new java.net.URI("http://sc.msn.com/2{/RC_U@A8BP]48B,MYL9{Z]-.jpg");
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
We may be writing our own URI class soon. Boy, it would be nice to be able to reuse some of the java.net.URI parsing and character-class stuff when we do this.
(Incident Review ID: 190762)
======================================================================
|
|
Work Around
|
N/A
|
|
Evaluation
|
The URI class has been designed to follow the specs to the letter.
The URL class is a lot more "flexible" when it comes to the specifications, and for the examples given here it would handle them.
Also, it is clearly documented that strings passed to URI constructors should be encoded first. So if we use the following code:
java.net.URI u = new java.net.URI("http://sc.msn.com/" + java.net.URLEncoder.encode("2{/RC_U@A8BP]48B,MYL9{Z]-.jpg", "ISO-8859-1"));
it works just fine.
So I don't think there is a need for a 3rd way to handle URI/URLs which would clutter the API.
jean- xxxxx@xxxxx 2003-10-21
|
|
Comments
|
Submitted On 14-JUN-2004
gojomo
(1) URI class has NOT been designed follow specs to the letter -- as noted in my report, the comments and javadoc in java.net.URI note numerous deviations from specs as necessary to match common practice.
(2) Example code given by 'jean' DOES NOT WORK. It encodes syntactically-significant characters as well (':', '//'', '/') and results in an unusable URI instance.
(3) The point of the submission has been missed: the real world of web clients and servers generate and use lots of 'illegal' URIs. Mozilla/IE and major web servers aren't sticklers for encoding, neither insisting on it nor patching it on the fly. (The same certainly goes for other non-HTTP URI-using domains, for example P2P apps.) For Java to play in this world, it would be nice if programmers had the option of using URIs looser practice, rather than official spec. And if you don't want to explicitly enable this, you could at least stop making it hard to add as a subclass by removing some of the private & final modifiers on java.net.URI.
(See also similar issue #5049974.)
Submitted On 21-JUN-2004
jcc
Point taken about the code sample.
I've updated the example so that it is now correct. But the reasoning stands.
PLEASE NOTE: JDK6 is formerly known as Project Mustang
|
|
|
 |