United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: 4481344 Performance of small image copies to screen slower in 1.4 than 1.1 (win32)
4481344 : Performance of small image copies to screen slower in 1.4 than 1.1 (win32)

Details
Type:
Bug
Submit Date:
2001-07-18
Status:
Open
Updated Date:
2001-08-31
Project Name:
JDK
Resolved Date:
Component:
client-libs
OS:
windows_nt
Sub-Component:
2d
CPU:
x86
Priority:
P4
Resolution:
Unresolved
Affected Versions:
1.4.0
Targeted Versions:

Related Reports
Relates:

Sub Tasks

Description
This bug is being filed to take over the small-image part of bug 4276423.
That original bug was filed against all image copies to the screen in JDK 1.2.
Since then, we have implemented faster image copies through hardware accelerated
images (on win32) and have seen substantial improvements in these operations
(vs. jdk 1.1, 1.2, and 1.3).

however, there are still performance improvements that would be nice for
small (20x20 and less) image sizes.  Here are some performance numbers
from some testing for that other bug report:

jdk1.1:
	20x20		37,724,773
	100x100		36,661,107
	300x300		36,808,163

jdk1.2:
	20x20		3,631,375
	100x100		31,629,213
	300x300		43,824,489
jdk1.3:
	20x20		3,728,680
	100x100		33,635,275
	300x300		42,096,774

jdk1.4:
	20x20		5,750,953
	100x100		108.602,065
	300x300		165,263,578

We should determine the bottlenecks for small image performance and try
to eliminate them.

                                    

Comments
EVALUATION

I will include the Evaluation from bug 4276423, since most of that
information is about the performance of small images:

I recently got the following results on my PIII-dual 866 NT4 system (video
card ATI Rage Pro Turbo), at 32 bits per pixel:

jdk1.1:
	20x20 		16,909,090 pps
	100x100		22,268,000 pps
	300x300		24,488,304 pps
jdk1.2:
	20x20		9,440,362 pps
	100x100		21,333,600 pps
	300x300 	24,570,419 pps

jdk1.3:
	20x20		9,106,579 pps
	100x100		21,231,683 pps
	300x300		24,616,363 pps

jdk1.4 (my most recent build):
	20x20		7,495,593
	100x100		22,236,607
	300x300		23,695,652

And on my PIII-500 (single CPU) win98 system with a Matrox G400 running
32 bits per pixel:

jdk1.1:
	20x20		37,724,773
	100x100		36,661,107
	300x300		36,808,163

jdk1.2:
	20x20		3,631,375
	100x100		31,629,213
	300x300		43,824,489
jdk1.3:
	20x20		3,728,680
	100x100		33,635,275
	300x300		42,096,774

jdk1.4:
	20x20		5,750,953
	100x100		108.602,065
	300x300		165,263,578

From these results, it looks like:
	- There are definitely differences between OS's and video cards,
	especially when we are comparing hardware-accelerated images and non-
	accelerated images.

	- The overhead of the small (20x20) images appears to drag down the
	performance of 1.4 offscreen images to nearly the level of the
	1.2/1.3 software-based images.  In fact, on the older ATI video
	card, the hw-based images were even slower than the software-based
	images.

	- NT performance of all images seems gated at some maximum amount.  This
	might be a restriction on NT, or it could be a constraint of the
	older video card.  More investigation would be necessary to figure
	it out.  But all larger image sizes on all releases seem about the
	same.

	- win98 shows the difference between jdk1.1 hw-based images
	(flying at about 36M pps) versus jdk1.2/1.2 sw-based images
	(limited to only about 3M on the smallest image).

	- win98 on this fast video card shows the advantage to directDraw
	in the latest jdk1.4 builds; performance of jdk1.1 was gated
	about around 36M pps, but the performance of DirectDraw-based images
	appears much higher, at around 165M pps for th largest image size.

more work is necessary.  We need to make sure that we eliminate any overhead
that might be contributing to the lower scores in jdk1.4 for small image
sizes.  Profiling is necessary...

chet.haase@Eng 2001-04-24

I did a little more debugging/profiling and got the following information:

One of the key pieces of overhead in our Blt processing is due to the
ddraw Clipper object.  When I eliminate the Clipper (i.e., I don't attach
it to the window or set the clipper on the primary), then I more than double 
the performance of the smalles (20x20) image copies.  On my test system
(PIII-866 dual processor, nVidia TNT2), this made the performance go from
11 M pixels per second to over 26 M pixels per second.

Of course, this is a bottleneck that we cannot do much about: drawing
without a Clipper object requires that we do our own clipping to the
window (not too hard) but it also means that we would be subject to
Windows events that could cause rendering artifacts.  For example, if 
our window was obstructed, we would do our Blts over any overlapping
windows, regardless of which window was supposed to be on top (ddraw
draws directly to the screen without regard for Window properties).
And even if our window was on top at the time we issued the Blt call, 
this might not prevent some event (such as the user dragging a window)
from overlapping the window at the time of the the actual Blt operation
(there is a delay between our issuing the call and that call actually
being processed by the hardware).  Actually, this situation might be
handled for us through context switching mechanisms of the driver/hardware
(hopefully the hardware would flush the graphics pipe before allowing the
window system to move things around).  But there is still a small hole of
opportunity between our checking for obstruction and actually issuing the call.

Anyway, this got our performance up to 26 M pixels per second.  But the jdk1.1
version is still at 44 M, nearly twice the performance of our non-clipped
jdk1.4 version.  I think this difference can be attributed to various
overhead elements in our drawImage() processing.  During a profiling run
(using Compuware's TrueTime product), I found that we are spending
significant amounts of time (on the order of one to five percent) in the
following routines:
	ClipInfo (used to derive the actual src/dst values after clipping
		against sg.getCompBounds()
	Blit.getFromCache() (gets the cache entry for our Blit call)
	DrawImage.blitSurfaceData (spends a couple of percent just dealing
		with setting the CompositeType)
	AcceleratedOffScreenImage.getSourceSurfaceData (gets the accelerated
		surfaceData object for accelerated images)

There are various other methods and simple operations which end up taking
over a percent of the runtime.  Many of these functions are very simple
(like the equals() comparison when retrieving the Blit from the cache), but
when called over 60000 times (in this case), they add up to significant
overhead.

The reason for performance loss due to overhead in this case is that the
primitives in question are so small (20x20) that the more we do between
issuing the call from the application and actually issuing the ddraw call,
the more we suffer from each intermediate step.  For the larger primitives,
the amount of overhead is now insignificant compared to the performance
time of the actual rendering so we see the performance benefits of
ddraw much more clearly.

(End of Evaluation text from 4276423)

More info from further analysis:

I wrote a native app that tested similar image copies using ddraw and
GDI images.  It varied the size of the images and performed three tests
with each size: ddraw without a Clipper, ddraw tih a Clipper, and GDI
compatible bitmap.  These tests were chosen to represent jdk 1.4
(ddraw with a Clipper), jdk 1.4 maximum possible (ddraw without a 
Clipper - not necessarily something we can do, but a nice theoretical
maximum to know about), and jdk 1.1 (they used compatible bitmaps for the
offscreen images in that release).

The numbers I got were interesting.  It turns out that GDI performs 
significantly better than ddraw w/o Clipper up to a size of about 32x32.
It performs better than ddraw with a Clipper (i.e., a comparison of
jdk1.1 and jdk1.4) up to a size of about 47x47.  After these values,
GDI drops significantly and ddraw is able to achieve much better 
performance results both with and without the Clipper after those points.

In fact, GDI hits a limit of about 65 Million pixels per second, but
ddraw is about to achieve over 200 million pixels per second with larger
primitives.

This data tells me that our performance bottleneck may not be due to
our overhead in getting to the ddraw call but night, instead, be due to
simple GDI performance advantages for smaller primitives.


chet.haase@Eng 2001-07-18
                                     
2001-07-18



Hardware and Software, Engineered to Work Together