Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 5028425
Votes 0
Synopsis (coll) Collection.retainAll(Collection) implementations are not optimized
Category java:classes_util
Reported Against 1.4.2
Release Fixed
State 6-Fix Understood, request for enhancement
Priority: 4-Low
Related Bugs
Submit Date 07-APR-2004
Description




FULL PRODUCT VERSION :
java version "1.4.2"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-b28)
Java HotSpot(TM) Server VM (build 1.4.2-b28, mixed mode)

java version "1.5.0-beta2"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-beta2-b45)
Java HotSpot(TM) Client VM (build 1.5.0-beta2-b45, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
 customer  Windows 2000 [Version 5.00.2195]

A DESCRIPTION OF THE PROBLEM :
Collection.retainAll(Collection) implementations aren't optimized.
When a sufficiently large collection is used as an argument, the
performance is pathologically slow.  (See attached program results)

There are several factors that could be taken into account to optimize the
performance of the retainAll(Collection) method.  The default implementation
provided by AbstractCollection could, modify its behavior depending on
the size of the specified collection and the cost of its containment test.

The attached code demonstrates a very simple way to avoid the current
pathological worst case (by creating a HashSet).  If the cost of constructing an unnecessary HashSet is to be avoided, an interface could be defined indicating that a Collection has an O(1) contains() method.

The retainAll(Collection) method can be viewed as an operation on two collections.

The concrete Collection implementations (HashSet, TreeSet, ArrayList, etc...)
have at their disposal.
- the sizes of both collections
- the element insertion speed of the implementing collection
- the element deletion speed of the implementing collection
- the containment test speed of the implementing collection
Additionally, the following information is likely to be known
- the element insertion speed of the argument collection
- the element deletion speed of the argument collection
- the containment test speed of the argument collection

So, not only could the implementation provided by AbstractCollection be
greatly improved, but the implementations provided by the concrete
implementations could be improved beyond that.

Similar optimizations should be made for the containsAll(Collection) and
removeAll(Collection) methods, too.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Invoke retainAll(collection) on a collection, using an argument with a relatively slow (>O(1)) containment check.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The execution speed of x.retainAll(y) should be no greater than O(x.size() + y.size()).
ACTUAL -
The execution spped of x.retainAll(y) can be as slow as O(x.size() * y.size()).
Partial execution results follow:

java.util.ArrayList took 32 ms  for 512
java.util.LinkedList took 0 ms  for 512
java.util.HashSet took 0 ms  for 512
java.util.TreeSet took 0 ms  for 512
Test$FastSet took 16 ms  for 512
java.util.ArrayList took 0 ms  for 1024
java.util.LinkedList took 0 ms  for 1024
java.util.HashSet took 0 ms  for 1024
java.util.TreeSet took 0 ms  for 1024
Test$FastSet took 0 ms  for 1024
java.util.ArrayList took 47 ms  for 2048
java.util.LinkedList took 15 ms  for 2048
java.util.HashSet took 16 ms  for 2048
java.util.TreeSet took 31 ms  for 2048
Test$FastSet took 16 ms  for 2048
java.util.ArrayList took 63 ms  for 4096
java.util.LinkedList took 31 ms  for 4096
java.util.HashSet took 47 ms  for 4096
java.util.TreeSet took 47 ms  for 4096
Test$FastSet took 31 ms  for 4096
java.util.ArrayList took 203 ms  for 8192
java.util.LinkedList took 156 ms  for 8192
java.util.HashSet took 156 ms  for 8192
java.util.TreeSet took 141 ms  for 8192
Test$FastSet took 0 ms  for 8192
java.util.ArrayList took 891 ms  for 16384
java.util.LinkedList took 703 ms  for 16384
java.util.HashSet took 719 ms  for 16384
java.util.TreeSet took 719 ms  for 16384
Test$FastSet took 16 ms  for 16384
java.util.ArrayList took 11515 ms  for 32768
java.util.LinkedList took 11047 ms  for 32768
java.util.HashSet took 10031 ms  for 32768
java.util.TreeSet took 9281 ms  for 32768
Test$FastSet took 62 ms  for 32768
java.util.ArrayList took 98015 ms  for 65536
java.util.LinkedList took 92234 ms  for 65536
java.util.HashSet took 91921 ms  for 65536
java.util.TreeSet took 94109 ms  for 65536
Test$FastSet took 78 ms  for 65536
java.util.ArrayList took 492825 ms  for 131072
java.util.LinkedList took 475794 ms  for 131072
java.util.HashSet took 476044 ms  for 131072
java.util.TreeSet took 473700 ms  for 131072
Test$FastSet took 281 ms  for 131072
...

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.*;

public class Test {

private static class FastSet extends HashSet {

    public FastSet(Collection collection) {
        super(collection);
    }

    /**
     * A very easy way to avoid the worst case performance.
     * There is still <b>lots</b> of room for improvement here.
     */
    public boolean retainAll(Collection collection) {
        if (collection.size() > 1000)
	        collection = new HashSet(collection);
	    return super.retainAll(collection);
    }
} // Fast Set

private static void testRetainAll(Collection full, Collection half) {
    int  size  = full.size();
    long start = System.currentTimeMillis();
    full.retainAll(half);
    long end   = System.currentTimeMillis();
    long time  = end - start;
    String name = full.getClass().getName();
    System.out.println(name + " took " + time + " ms " + " for " + size);
}

private static void testRetainAll(int size) {

    // create a list of size integers
    List full = new ArrayList(size);
    for (int i=0; i<size; i++)
        full.add(new Integer(i));

    // create a random list half the size of the full set
    List half = new ArrayList(full);
    java.util.Collections.shuffle(half);
    while (half.size() > full.size() / 2 )
        half.remove(0);

    testRetainAll(new ArrayList(full),half);
    testRetainAll(new LinkedList(full),half);
    testRetainAll(new HashSet(full),half);
    testRetainAll(new TreeSet(full),half);
    testRetainAll(new FastSet(full),half);
}

public static void main(String[] args) {
    int x=512;
    for (int i=0; i<28; i++) {
        testRetainAll(x);
        x *= 2;
    }
}

}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Workarounds:

a) Extend HashSet, TreeSet, etc.. to avoid the problem, and use the replacements, instead.
b) Make sure the collection arguments are transformed, when needed.  In other words, use something like this:
  x.retainAll(new HashSet(y));
c) Write an efficient library method like:
  public static Collection retainAll(Collection a, Collection b);
(Incident Review ID: 227973) 
======================================================================
Work Around
N/A
Evaluation
Some optimizations along the lines suggested might be justified.  Performance work on general-purpose collections is a tricky business.  It's not clear whether to optimize for worst-case or expected case behavior.  The reporter claims that the implementation is likely to know the cost of containment tests on the argument collection, but there is no reliable way to do this (in general).  It would be possible to add a marker interface akin to RandomAccess that allows a collection to say whether it has a "quick" containment test (whatever that means; probably O(log n) or better).

More generally it's not clear where our perforamnce tuning resources are best spent.  There is a relatively easy workaround for this one: if you're going to do removeAll or retainAll and the argument collection is large doesn't have a quick containment test, copy it yourself:

   foo.removeAll(new HashSet(baz));

  xxxxx@xxxxx   2004-04-13
Comments
  
  Include a link with my name & email   

Submitted On 14-APR-2004
CurtCox
Clearly determining where to optimize is a tricky art.  I
defend my claim about what is likely to be known, however. 
Most collections given as arguments will be the ones from
java.util.*.

With that said, adding a marker interface is superior to
hardcoding tests for specific classes, since third parties
such as Jakarta could implement the supplied interface where
appropriate.
http://jakarta.apache.org/commons/collections/

Two marker interfaces, geared towards describing the
performance characteristics of collection implementations,
don't seem out of line.  On the other hand, it could be the
start of a proliferation.  Why not define a Big O metadata
standard that could be used to annotate any
method?  That way implementations of anything (not just
collections) could code defensively to avoid pathological
performance, if desired.
http://en.wikipedia.org/wiki/Big_O_notation

Better still, it would greatly aid in the development of a
tool that could detect potential pathological performance
problems at compile-time, and allow the developer to decide
then whether special case handling is warranted. 

Now, about the worst-case verses expected-case trade off. 
The size test given in the example solution, would have a
slight impact on expected-case behavior.  The impact on
worst-case behavior is very dramatic, however.

I think its generally better to prevent pathological
performance cases where possible, because they are so hard
to anticipate beforehand.


Submitted On 06-OCT-2004
rkippen
http://forum.java.sun.com/thread.jsp?forum=426&thread=560633



PLEASE NOTE: JDK6 is formerly known as Project Mustang