 |
ASM 3.0 Developer Guide
by Eric Bruneton
02/02/2006, updated 23/10/2007
- Introduction
- Installation
- Overview
- Building
- Running tests
- Running examples
- Creating a distribution
- Code review
- Code organization
- Main data structures
- Main algorithms
- ClassReader
- ClassWriter, AnnotationWriter, FieldWriter and MethodWriter
- Label
- toByteArray
- Instruction resizing algorithm
- Basic algorithm
- Impact on code attributes
- An example
- Control and data flow analysis algorithm
- Basic data flow analysis algorithm
- Uninitialized types
- Exception handlers
- Dead code
- Subroutines
- Code optimizations
- Optimizing performance
- Optimizing code size
- An example
1 Introduction
This guide is primarily intended for ASM users that would like to contribute
to the ASM code base, although it can be of interest for other users as well.
It explains the organization of the code, the main data structures and the most
complex algorithms. It also explains the strategies that have been used to
optimize ASM performances as well as to minimize its code size, and illustrates
this through a concrete example.
2 Installation
This section explains how to compile the ASM source code, how to test it, and
how to build a distribution, which is the first thing to know before
contributing to this project.
2.1 Overview
The ASM source distribution is organized in several directories, in a way
that is shared with several ObjectWeb projects. These directories are the
following (see also the README.txt files contained in these
directories):
archive contains one Ant file per jar file that must be
included in the ASM distribution.
config contains jar files that are needed to build ASM, but
which are not needed at runtime.
doc contains non Javadoc based documentation for ASM. It is
currently empty.
examples contains examples showing how ASM can be used.
jdoc contains one Ant file per Javadoc documentation that must
be included in the ASM distribution.
output contains the results and artifacts produced by the
building process. It is created automatically.
src contains the ASM source code.
test contains the ASM conformance and performance tests.
The build.xml file defines the targets that are used to build,
test and package the ASM source code. These targets are presented in the
following sections.
2.2 Building
The source code, in the src directory, can be compiled with
ant compile (this requires a Java 1.3 or higher compiler).
The compiled code is stored in output/build/tmp, and is then
optimized to reduce its size. The optimized code is finally stored in
output/build.
Note: the optimization step can be skipped with ant
-Dproduct.noshrink=true compile.
2.3 Running tests
Tests are stored in the test directory. Conformance tests are
stored in the conform sub directory, while performance tests are
stored in the perf sub directory. The lib sub
directory contains jar files that are needed only to run tests. The tests can
be run in several different ways (note that some tests require a Java 1.6 or
higher runtime):
ant test runs all the tests, which takes a long
time. Note that it is not necessary to explicitly compile the code before
running tests: this is done automatically if needed.
ant -Dtest.type=type test runs only one type
of tests, where type can be conform for conformance tests or
perf for performance tests.
ant -Dtest.group=group test runs only one group
of tests, where group is the name of an Ant file without the
.xml extension. For instance ant
-Dtest.group=conform/classwriter test runs the
conform/classwriter.xml script, which itself runs the
ClassWriterTest test suite.
The test target compiles the tests into the
output/test directory. It also generates some .class
files that are used as input for these tests (these test cases are generated in
the output/test/cases directory by the
org.objectweb.asm.test.cases.Generator class). Finally it runs the
requested tests and stores the reports in output/test/reports.
Test coverage can be computed and reported with the ant
coverage and ant coverage.report targets. The
first target instruments the ASM classes and stores the instrumented classes in
output/instr, and then run all the conformance tests on the test
cases in output/test/cases. The second target generates an HTML
report in output/coverage.
New conformance tests can be added by writing a new JUnit test class in
test/conform/org/objectweb/asm/..., and by adding a new Ant script
to run these tests in test/conform. Note that, by convention, test
classes are named like the class they test, with a Test or
UnitTest suffix (XTest work on full compiled
Java classes, while XUnitTest work on class fragments).
Note: by default tests are run with the optimized ASM classes, which
do not contain debug info. In order to have line numbers in stack traces, it is
possible to run the tests with the unoptimized ASM classes, with ant
-Ddebug=true test.
2.4 Running examples
Examples can be run by building a binary distribution (see
next section), and then by running each example from its
directory in this generated distribution. But there is an easier way, that does
not require creating a binary distribution. In fact the example example
can be run with ant -Dexample.name=example example
(where example is the name of a directory in the examples
directory).
2.5 Creating a distribution
A binary distribution, containing the compiled ASM libraries, the Javadoc and
the ASM examples, can be created with ant dist. The result
is generated in the output/dist directory. It is also possible to
generate this distribution in a zip file format, with ant
zip. The result is generated in the output/zip directory
(this also generates a source distribution in a tar.gz file).
Finally ASM can also be distributed as an Eclipse plugin, which can be generated
with ant eclipse.site. Optionally ant
clean can be used before creating a distribution, in order to rebuild
everything from scratch (indeed this target removes everything in the
output directory).
Note: the version number used in the files produced when creating a
distribution is defined in the build.properties file.
3 Code review
This section presents the ASM source code organization, the main data
structures and the most complex algorithms, which is the next important thing to
know, after the installation and building issues, in order to contribute to this
project.
3.1 Code organization
ASM is organized in several packages:
org.objectweb.asm is the core package. It defines the ASM
visitor API and provides the ClassReader and
ClassWriter classes to read and write compiled Java classes. This
package does not depend on the other ones and can be used alone.
org.objectweb.asm.signature provides an API to read and write
generics signatures. It is independent of the core package but complements
it.
org.objectweb.asm.tree provides a DOM like API on top of the SAX
like API provided by the core package. It can be used to implement complex class
transformations, for which the core package would be too complicated to
use.
org.objectweb.asm.tree.analysis provides a static bytecode
analysis framework on top of the tree package. It can be used in addition to the
tree package to implement really complex class transformations that need to know
the state of the stack map frames for each instruction.
org.objectweb.asm.util provides some useful class visitors and
adapters that can be used for debugging purposes. It is generally not needed at
runtime.
org.objectweb.asm.xml provides the ability to convert classes
to and from XML. It can be used to prototype classes transformations quickly, by
using XLST. But it is not recommended for real world class adapters, since the
performances of this package are much lower than the performances of the core
package.
org.objectweb.asm.commons provides some useful class adapters
that are based on the core package. These adapters can be used as is or can be
extended to implement more specific class transformations.
org.objectweb.asm.optimizer is not part of the ASM
distribution. Indeed it is the optimization tool used to reduce the size of ASM
classes, which is itself based on ASM. The renaming of class members is defined
in the shrink.properties file.
From an implementation point of view the core package is the most complex
one. The tree, util and xml packages are
very simple (they just convert classes from one high level representation to
another one, which is much simpler than converting classes from a low level
representation - namely a byte array - to a high level one or vice versa). The
signature package is also quite simple (it consists in a parser and
a pretty printer for a small grammar). In fact the only packages that are not
completely trivial, except the core package, are the commons and
tree.analysis packages. But the algorithms used in
tree.analysis package are similar to the one explained in section
3.5. This explains why this guide is focused on
the core package only.
3.2 Main data structures
The core package is made of 20 classes and interfaces. If we exclude the 5
interfaces and the 3 utility classes (ClassAdapter,
MethodAdapter and Type), this leaves only 12 classes,
which are depicted on the figure below.
The conversion of compiled classes to visit events is done by only one class,
namely the ClassReader class. The inverse process is done by the
other ten classes, which are organized around the ClassWriter
class:
- Classes:
- the
ClassWriter class is the main entry point. It contains
fields that describe the class version, access flags, name, etc. It also
contains references to other objects that represent the constant pool, the
fields, the methods, the annotations and the attributes of the class.
- Constant pool:
- the
Item class is used to represent constant pool items. Only
one class is used in order to save code size, which explains why the
Item class has as many fields as possible constant pool item
types.
- Fields:
- the
FieldWriter class is used to write fields. It contains
fields that describe the field's name, type, signature, value, etc. It also
contains references to other objects that represent the field's annotations and
attributes.
- Methods:
- the
MethodWriter class is used to write methods. It contains
fields that describe the method's name, signature, exceptions, etc. It also
contains references to other objects that represent the method's annotations and
attributes. The method's code is stored in a byte array that is constructed
during the visit of bytecode instructions. The labels used to reference
instructions are stored in a linked list of Label
instructions.
- the
Handler class is used to represent try catch blocks. Each
handler references three Label objects that define the start and
end of the try block, and the start of the catch block.
- the
Label class is used to reference instructions, but also to
represent basic blocks, which are used for the automatic computation of the
maximum stack size and of the stack map frame table of a method.
- the
Frame class is used for the automatic computation of the
stack map frame table of a method.
- the
Edge class is used to represent the control flow graph of a
method, which is a graph of basic blocks, i.e. of Label objects. An
Edge is an edge between two Label objects in this
graph.
- Annotations:
- the
AnnotationWriter class is used to write annotations. This
class is referenced from the ClassWriter, FieldWriter
and MethodWriter classes, since classes, fields and methods can
have annotations.
- Attributes:
- the
Attribute class is used to read and write non standard
class attributes. It must be subclassed for each specific non standard attribute
that must be read and written. This class is referenced from the
ClassWriter, FieldWriter and MethodWriter
classes, since classes, fields and methods can have attributes.
- Ressources:
- the
ByteVector class is used to serialize the class elements
while they are visited. It is used to represent the constant pool, the
annotation values, the method's code, the stack map tables, the line number
tables, etc.
The core package does not use any java.util class. Collections
are represented as linked lists whose links are stored directly in the list
elements themselves. For instance a list of FieldWriter objects is
represented as the FieldWriter objects themselves, linked through
their next field. Likewise for MethodWriter,
AnnotationWriter, Label, etc. The advantage of this
method, compared to using separate objects to store the linked list itself (as
in java.util.LinkedList) is that it saves memory. The drawback is
that a given element cannot belong to several lists at the same time, but this
is not a problem in the ASM case.
The core package does not use hash tables, except one in the
ClassWriter class, which is used to represent a set of
Item objects. It is implemented with an array of Item
objects that can be chained together through their next field to
handle the case of hash code collisions. In other words, like for lists, the
hash table structure is embedded in the hash table elements themselves. The
advantages and drawbacks are the same (saves memory but elements cannot belong
to several hash tables at once).
Like for the previous data structure, the control flow graph (see
section 3.5) data structure is embedded in the graph
nodes themselves, i.e. in the Label objects. Since
Label objects must be stored in several data structures at the same
time, they have several distinct fields that encode these data structures:
- the
successor field is used to encode the list of labels of a
method, in the order they are visited.
- the
next field is used to encode the list of "changed" basic
blocks that is used in visitMaxs (see
section 3.5.1).
- the
successors field is used to store the list of
Edge objects that represent the control flow graph data
structure.
3.3 Main algorithms
3.3.1 ClassReader
The ClassReader algorithm is quite straightforward, but the
length of the accept method can make it difficult to see. For this
reason it is summarized in this section.
- parse the constant pool
- store the start index of each constant pool item in
items
- store the size of the longest string constant in
maxStringLength
- parse the class
- parse the header
- skip fields and methods
- parse the class attributes. Depending on the attribute:
- either analyze it directly (e.g.
Signature)
- or just store its start index (e.g.
Annotations)
- call
visit, visitSource,
visitOuterClass
- parse annotations and call
visitAnnotation for each
annotation
- call
visitAttribute for each non standard attribute parsed
during third step
- parses inner classes info and call
visitInnerClass for
each
- for each field
- parse the header
- parse the field attributes. Depending on the attribute:
- either analyze it directly (e.g.
Signature)
- or just store its start index (e.g.
Annotations)
- call
visitField
- parse annotations and call
visitAnnotation for each
annotation
- call
visitAttribute for each non standard attribute parsed
during second step
- call
visitEnd
- for each method
- parse the header
- parse the method attributes. Depending on the attribute:
- either analyze it directly (e.g.
Signature)
- or just store its start index (e.g.
Annotations)
- call
visitMethod
- if the returned visitor is a
MethodWriter, and if its
ClassWriter's constant pool was copied from this reader (see
section 3.3.2), the method bytes can be copied as is:
then all steps below are skipped.
- parse annotations and call
visitAnnotation for each
annotation
- call
visitAttribute for each non standard attribute parsed
during second step
- find labels and store then in the
labels array
- look for labels in the code
- look for labels in the exception handlers
- look for labels in the line number and local variable tables
- look for labels in the other code attributes
- if there is a stack map table, parse the first frame
- parse the instructions
- if there is a stack map frame for this offset, call
visitFrame,
and parse next frame
- if there is a label for this offset, call
visitLabel
- if there is a line number entry for this offset, call
visitLineNumber
- call
visitXInsn
- call
visitAttribute for each non standard code attribute parsed
during second step
- call
visitMaxs
- call
visitEnd
- call
visitEnd
Some points are interesting to note:
- the visit of line numbers and stack map frames is interleaved with the visit
of instructions. In the case of stack map frames, not only the visit, but also
the parsing of the stack map table and of the method's code is
interleaved. The advantage, compared to a parsing of the stack map table
followed by a parsing of the method's code, is that no complex data structure
is needed to store the parsed frames for the second step.
- constant pool items are parsed every time they are referenced, except UTF8
items, whose values are cached in the
strings array. Not also that
a single array is reused to parse these items. It must be large enough to parse
the longest string, hence the computation of maxStringLength.
3.3.2 ClassWriter, AnnotationWriter, FieldWriter and MethodWriter
Since the visit of the class members can be interleaved (it is possible to
start visiting a field, then start visiting a method, go back to visit
annotations of the field, continue with some instructions of the method, visit
attributes of the field, add new instructions to the method, and so on), it is
not possible to construct the class' byte array in a sequential order, from
beginning to end. Instead it is necessary to use several byte vectors that can
grow simultaneously. This is why there are several writer classes, unlike for
the reader case.
The ClassWriter class is the main entry point. Its most
important role, beside storing the class header elements and referencing the
list of its fields and methods, is to manage the constant pool. For that the
ClassWriter class uses a hash map, which is used as a set, in order
to avoid adding the same item several times in the constant pool (which is
represented by a simple byte vector). The constant pool can be copied from an
existing one by passing a ClassReader argument to the
ClassWriter constructor (see the copyPool method in
ClassReader). This allows unchanged methods to be detected and
copied as is from reader to writer, without visiting their content (see
section 3.3.1).
The AnnotationWriter and FieldWriter classes are
quite simple: they convert visit events to a byte array representation, by using
ClassWriter in order to add constant pool items when necessary. The
most complex writer class is the MethodWriter class, because it
manages advanced features such as automatic maxStack and maxLocals computation,
automatic stack map frames computation, and automatic handling of wide goto and
jsr. Without these features, each visitXInsn method would be
very simple, i.e. it would just add the bytecode representation of an
instruction to the code byte vector.
In order to be able to automatically compute maxStack and maxLocals, and to
compute stack map frames, each visitXInsn method does in
fact more than that: it constructs the control flow graph of the method (see
section 3.5), keeps track of the maximum number of
local variables used, and simulates the execution of the visited instruction to
keep track of the maximum stack size (for computing maxStack) or of the actual
stack element types (for computing stack map frames). In fact the general form
of a visitXInsn method is the following:
- if (currentBlock != null) // == if computeMaxs or computeFrames, and if this
instruction may be reachable
- if (computeFrames)
- simulate the execution of this instruction on stack
- else // computeMaxs
- simulate the execution of this instruction on stack height
- keep track of the local variables used, if any
- keep track of the successors of this instruction
- update currentBlock
- append the instruction to the
code byte vector
Also the visitMaxs method is more complicated due to these
advanced features. It is indeed here that maxStack, maxLocals and stack map
frames are computed, based on the control flow graph constructed in the
visitXInsn methods (see
section 3.5 for more details).
3.3.3 Label
From a user point of view the Label class is used to reference
instructions. Internally it is used to store the position in bytecode of an
instruction, and to compute bytecode offsets between this instruction and any
other instruction (it is also used to represent basic blocks, which are used for
the automatic computation of the maximum stack size and of the stack map frame
table of a method - see section 3.5).
Jump instructions such as IFEQ or GOTO are stored
in bytecode as an opcode followed by an offset to the target instruction (this
offset is the difference between the position in bytecode of the target
instruction and the position in bytecode of the jump instruction). This offset
can be computed easily in the case of a backward jump (i.e. a jump to an
instruction that is before the jump instruction in the bytecode), but it cannot
be computed at all in the case of a forward jump (i.e. a jump to an instruction
that is after the jump instruction) since, in this case, the target instruction
has not been visited yet, and so its position is unknown. The case of forward
jumps is solved in the following way:
- The jump instruction is written with a (temporary) null offset.
- The target
Label object is updated to memorize the fact that
this jump instruction makes a forward reference to this label (the
srcAndRefPositions array in Label is used for
that).
- When this label is visited, i.e. when its position becomes known, all the
forward jump instructions to this label (given by
srcAndRefPositions) are updated, to replace the temporary null
offsets with their real values, which can now be computed.
3.3.4 toByteArray
The toByteArray method in ClassWriter puts together
all the pieces constructed in the various writer classes in order to get the
full byte representation of the class. This is done in two steps:
- the size of the class is computed by summing the size of all the pieces,
which is given by the
getSize method (this method can add items to
the constant pool, which modifies its size; this is why the constant pool size
is added only at the very end).
- a byte vector of this size is allocated, and the pieces are copied into this
vector in the right order. This is done by calling the
put method
on each piece.
3.4 Instruction resizing algorithm
By default the jump instructions such as IFEQ and
GOTO store a bytecode offset in two bytes. This offset can
therefore vary between -32768 and 32767. However the bytecode of a method can be
as large as 65535 bytes: it is therefore possible to have bytecode offsets that
cannot be represented in two bytes. Hopefully there are two special jump
instructions that store their bytecode offset in four bytes, namely
GOTO_W and JSR_W.
In the case of backward jumps the jump offset is known when the jump
instruction is visited. It then easy to use GOTO or
GOTO_W, depending on the value of this offset. But in the case of
forward jumps this is not possible. A first solution is to assume that this
offset will require 4 bytes, and to reserve enough space for that. But this is
not very optimal. A second solution is to assume that this offset will require 2
bytes, and to reserve only 2 bytes for it. The problem is that if the offset
turns out to require 4 bytes, the jump instruction must be resized to insert two
additional bytes. In ASM the second solution was chosen. It requires a method
to resize the forward jump instructions that turn out to require long offsets.
The algorithm used in this resizing method is presented in this section.
3.4.1 Basic algorithm
The first idea is to resize instructions in only one pass, instead of
resizing them on the fly, in visitLabel, each time a forward jump
turns out to require a long offset. Indeed the resizing method requires a full
parsing of the bytecode, which is quite slow. It is therefore important to avoid
parsing the bytecode several times.
But this introduces a new problem: indeed long forward jumps must somehow be
memorized until the end of the method is reached, but they cannot be stored in
bytecode since the space reserved for them (2 bytes), is insufficient. In fact
this is possible, because the maximum method size, and therefore forward
jump offset, is 65535, which can be stored in an unsigned short. Also not
all byte values correspond to valid opcodes, so it is possible to define new
opcodes (that we note GOTO_L, JSR_L,
IFEQ_L, and so on) for long forward jump instructions, whose offset
is stored, by definition, as an unsigned short. Given these elements, if it
turns out, in visitLabel, that a forward offset is greater than
32737, the opcode of the corresponding instruction is changed to the equivalent
non standard opcode with unsigned offset, and the offset is stored as an
unsigned short.
We now have all the elements to resize the forward jump instructions that
require long offsets in one pass, once all the method has been visited. In
practice this consists in replacing the non standard opcodes introduced in the
previous step with standard ones. This resizing is done in several steps:
- first the bytecode is scanned in order to find the instructions that need to
be resized, and to find where and how much extra bytes must be inserted in
bytecode.
- then the bytecode is scanned a second time. This time a new bytecode array
is constructed in parallel, containing the resized instructions. All offsets
(either short or long) are recomputed, based on the information collected in the
first step (given a list of insertion points, it is easy to count the number of
extra bytes that must be inserted between a given source and target instruction,
and therefore to update an offset value).
The complex step is the first one. This is because the instructions that need
to be resized are not limited to the long forward jump instructions that were
detected during the visit of the method's bytecode. Indeed resizing an
instruction may transform a normal jump instruction into a long jump
instruction, which must then be resized too (see the example in
section 3.4.3). The first step is therefore
iterative. If, in an iteration, some new instructions to be resized are found,
a new iteration is performed. If, on the contrary, no new instructions to be
resized are found, the algorithm stops (in fact other iterations are necessary,
due to the padding of TABLESWITCH and LOOKUPSWITCH
instructions - see the comments in source code for more details).
3.4.2 Impact on code attributes
Some code attributes contain bytecode offsets, and must be updated when
instructions are resized. This is the case for example of the
LineNumberTable and LocalVariableTable attributes. These
attributes are updated by parsing them and by updating offsets in place (by
chance all these offsets are stored in four bytes, so no resizing is
necessary).
The stack map table attribute also contains bytecode offsets, and must be
updated when instructions are resized. But this is much more complicated than
with the previous attributes. One reason is that, depending on offsets between
frames, frames are not compressed in the same way (for instance
SAME_FRAME or SAME_FRAME_EXTENDED). Another reason is
that long forward IFX instructions are resized by
transforming them into IFNOTX GOTO_W, which
changes the control flow graph of the method, and therefore the frames that must
be stored in the stack map table.
For these reasons it is not possible to incrementally update the stack map
table, as in the LineNumberTable case for example. Instead the
stack map table is recomputed "from scratch", i.e. from the (modified) control
flow graph (see section 3.5). But this is possible
only if this control flow graph was computed, i.e. if the
COMPUTE_FRAMES option was used. The solution used for the other
cases is to reparse and rewrite and the whole class, this time with the
COMPUTE_FRAMES option (this is done in toByteArray,
if the invalidFrames flag indicating that a method containing a
stack map frame table was resized is set).
3.4.3 An example
Consider the following Java code:
public void m(int i, int j) {
for (; cond(i); --i) {
if (j == 0) {
break;
}
...
}
}
|
It is compiled into:
public m(II)V
GOTO L1
L2
ILOAD 2
IFNE L3
GOTO L4
L3
...
IINC 1 -1
L1
ALOAD 0
ILOAD 1
INVOKEVIRTUAL C.cond(I)Z
IFNE L2
L4
RETURN
|
During the visit of each instruction, offsets are computed step by step as
follows:
| Position |
Instruction |
Comment |
| 0 |
GOTO L1 |
offset unknown when this instruction is visited |
| 3 |
L2 |
|
| 3 |
ILOAD 2 |
|
| 4 |
IFNE L3 |
offset unknown when this instruction is visited |
| 7 |
GOTO L4 |
offset unknown when this instruction is visited |
| 10 |
L3 |
offset for IFNE L3 becomes known: 10 - 4 = 6 |
| 10 |
... |
|
| 32764 |
IINC 1 -1 |
|
| 32767 |
L1 |
offset for GOTO L1 becomes known: 32767 - 0 = 32767 |
| 32767 |
ALOAD 0 |
|
| 32768 |
ILOAD 1 |
|
| 32769 |
INVOKEVIRTUAL |
|
| 32772 |
IFNE L2 |
offset = 3 - 32772 = -32769 |
| |
L4 |
|
| |
RETURN |
|
|
The offset computed for IFNE L2 is -32769, which cannot be
stored in a short. The instruction is therefore transformed into
IFEQ GOTO_W L2 on the fly:
| 0 |
GOTO L1 |
offset = 32767, changed during visit of L1 |
| 3 |
L2 |
|
| 3 |
ILOAD 2 |
|
| 4 |
IFNE L3 |
offset = 6, changed during visit of L3 |
| 7 |
GOTO L4 |
offset still unknown |
| 10 |
L3 |
|
| 10 |
... |
|
| 32764 |
IINC 1 -1 |
|
| 32767 |
L1 |
|
| 32767 |
ALOAD 0 |
|
| 32768 |
ILOAD 1 |
|
| 32769 |
INVOKEVIRTUAL |
|
| 32772 |
IFEQ L4 |
offset = 8 |
| 32775 |
GOTO_W L2 |
offset = 3 - 32775 = -32772 |
| 32780 |
L4 |
offset for GOTO L4 becomes known: 32780 - 7 = 32773 |
| |
RETURN |
|
|
The offset computed for GOTO L4 is 32773, which cannot be stored
as a signed short. As explained in section 3.4.1, the
GOTO instruction is then replaced with a non standard
GOTO_L instruction, whose offset is stored as an unsigned
short:
| 0 |
GOTO L1 |
offset = 32767, changed during visit of L1 |
| 3 |
L2 |
|
| 3 |
ILOAD 2 |
|
| 4 |
IFNE L3 |
offset = 6, changed during visit of L3 |
| 7 |
GOTO_L L4 |
offset = 32773, changed during visit of L4 |
| 10 |
L3 |
|
| 10 |
... |
|
| 32764 |
IINC 1 -1 |
|
| 32767 |
L1 |
|
| 32767 |
ALOAD 0 |
|
| 32768 |
ILOAD 1 |
|
| 32769 |
INVOKEVIRTUAL |
|
| 32772 |
IFEQ L4 |
offset = 8 |
| 32775 |
GOTO_W L2 |
offset = 3 - 32775 = -32772 |
| 32780 |
L4 |
|
| 32780 |
RETURN |
|
|
Since the bytecode contains at least one non standard instruction, the
resizeInstructions method is called in order to replace them with
standard ones. As explained in section 3.4.1, the
first step, iterative, consists in finding where and how much extra bytes must
be inserted into the bytecode. The first iteration of this first step finds that
GOTO_L must be replaced with a GOTO_W and that,
consequently, 2 extra bytes must be inserted before index 10. Since at least one
new instruction to be resized was found, a new iteration is performed. This
time, when looking at the first instruction, GOTO L1, it is found
that the jump offset, taking into account the fact that 2 extra bytes will be
inserted before index 10, will become 32769. The first instruction will therefore
need to be resized too, and 2 extra bytes must be inserted before index 3. The
second iteration ends without detecting other instructions needing to be
resized. However, since a new instruction to be resized was found, a third
iteration is performed. This time no new instruction to be resized is found, and
so the first step ends with the following result: insert 2 bytes before index 3,
and 2 bytes before index 10 (in original bytecode).
The second step parses the bytecode a fourth and last time, and constructs
the resized instructions at the same time, copying instructions as is if they
do not need to be resized (except for offsets, which are all recomputed), or
changing them in the other case. The result is shown below:
| 0 |
GOTO_W L1 |
offset = 32771 |
| 5 |
L2 |
|
| 5 |
ILOAD 2 |
|
| 6 |
IFNE L3 |
offset = 8 |
| 9 |
GOTO_W L4 |
offset = 32775 |
| 14 |
L3 |
|
| 14 |
... |
|
| 32768 |
IINC 1 -1 |
|
| 32771 |
L1 |
|
| 32771 |
ALOAD 0 |
|
| 32772 |
ILOAD 1 |
|
| 32773 |
INVOKEVIRTUAL |
|
| 32776 |
IFEQ L4 |
offset = 8 |
| 32779 |
GOTO_W L2 |
offset = -32774 |
| 32784 |
L4 |
|
| 32784 |
RETURN |
|
|
3.5 Control and data flow analysis algorithm
This section presents the algorithm used to compute the maximum stack size
and the stack map frame table of a method. This algorithm is a control and data
flow analysis algorithm, based on the decomposition of the method into a
control flow graph of basic blocks. A basic block is a sequence of
instructions where only the first instruction is the target of a jump
instruction, and where only the last instruction can jump to other basic
blocks. The control flow graph of a method is the graph whose nodes are the
basic blocks, and whose edges connect the basic blocks that are linked by jump
instructions. This graph is constructed during the visit of each instruction.
As an example, the control flow graph of the method defined in
section 3.5.1 is the one shown in
section 2.
3.5.1 Basic data flow analysis algorithm
Stack map frames are computed in a two steps process:
- During the visit of each instruction, the state of the frame at the end of
the current basic block is updated by simulating the action of the instruction
on the previous state of this so called "output frame".
- In visitMaxs, a fix point algorithm is used to compute the "input frame" of
each basic block, i.e. the stack map frame at the beginning of the basic block,
starting from the input frame of the first basic block (which is computed from
the method descriptor), and by using the previously computed output frames to
compute the input state of the other blocks.
Let's take a simple example in order to explain the details of this
algorithm. Consider the following very simple method:
public static m(Z)LA;
GETSTATIC B.VALUE : LB;
ASTORE 1
GOTO L0
L1
GETSTATIC A.VALUE : LA;
ASTORE 1
L0
ILOAD 0
IFNE L1
ALOAD 1
ARETURN
|
First step
As stated above, during the first step of the algorithm, which takes place
in each visitXInsn method in ClassWriter, the
state of the output frame of each basic block is updated by simulating the
action of the visited instruction. It is important to note that the algorithm
works at the basic block level, and not at the instruction level. This means
that input and output frames are associated to basic blocks, and not to
individual instructions. In practice, they are stored in a Frame
object that is associated to a Label object, which marks the
beginning of basic blocks.
The effect of this first step for the above example method is illustrated on
the table below:
| Label |
Instruction |
Output frame |
Comment |
| |
GETSTATIC B.VALUE : LB; |
O1 = [?] [?B] |
getstatic pushes a value of type B on the stack |
| |
ASTORE 1 |
O1 = [?B] [?] |
astore consumes this value and stores it in local 1 |
| |
GOTO L0 |
O1 = [?B] [?] |
goto does not change the frame |
| |
L1 |
GETSTATIC A.VALUE : LA; |
O2 = [?] [?A] |
each basic block starts with a new, unknown frame |
| |
ASTORE 1 |
O2 = [?A] [?] |
astore stores the value produced by getstatic in local 1 |
| |
L0 |
ILOAD 0 |
O3 = [?] [?I] |
iload pushes the value of local 0, which is of type int |
| |
IFNE L1 |
O3 = [?] [?] |
ifne consumes this value |
| |
| |
ALOAD 1 |
O4 = [?] [?L1] |
aload pushes the value of local 1, which is unknown |
| |
ARETURN |
O4 = [?] [?] |
areturn consumes this value |
|
At the beginning, the output frame O1 of the first basic block is completely
unknown. During the visit of the first instruction, the action of
GETSTATIC is simulated: the result is that a new value of type B is
pushed on the stack, on top of the previous values (although we know here that
the stack is initially empty, we do not take this into account and do as if the
stack could previously contain any number of values of any type - hence the
[?B]). During the visit of the second instruction, the output frame O1 is
updated to simulate the effect of ASTORE: the result is that the
value previously pushed on the stack is popped and stored in the local variable
1. The visit of the third instruction does not change the output frame O1, but
changes the current basic block to null.
The visit of the L1 label makes L1 become the new current basic block. Like
for the first basic block, the output frame O2 of this basic block is initially
completely unknown. The visit of the instructions of this basic block is then
similar to the visit of the previous instructions.
The visit of the L0 label makes L0 become the new current basic block. Here
again we start with a completely unknown output frame O3 although, in this
case, we could start from the value of O2 (since this basic block is a
successor of the previous one). The ILOAD instruction loads the
value of local variable 0, which is necessarily of type int (the whole
algorithm is based on the assumption that the code is correct), and pushes it
on the stack. The IFNE instruction consumes this value.
Another effect of simulating the action of the IFNE instruction
is to create a new basic block, and to make it the new current basic block.
This is why, although there is no label just after this instruction, the basic
block changes. Here again, the output frame O4 of this basic block is initially
completely unknown (although, as before and for the same reason, we could start
from the value of O3). The ALOAD instruction loads the value of
local variable 1, whose type is unknown since the frame is initially completely
unknown. The only thing we know is that, after the execution of this method,
the stack contains one additional value whose type is the type of local
variable 1 before this instruction was executed (hence the [?L1]).
During the second step of the algorithm, which takes place in the
visitMaxs method, the input frames of each basic block are computed by
using an iterative fix point algorithm (i.e. an algorithm to find a fix point
x of a function f, defined by f(x)=x. If x values
define a complete lattice and if f is monotonic,
xn+1=f(xn) converges to the smallest fix point of
f, according to the
Tarski
theorem. Here x is a control flow graph with input and output frames,
f is a "merge" function and the order relation is based on the subtyping
relation for the verification type system). First the input frame I1 of the
first basic block is computed from the method descriptor "public static
m(Z)LA;", which gives I1 = [I] []. Then the first basic block is marked as
"changed", and the fix point algorithm starts:
| Iteration |
Changed |
Output frames |
Input frames |
Comment |
| 0 |
B1 |
O1 = [?B] [?] O2 = [?A] [?] O3 = [?] [?] O4 = [?] [?] |
I1= [I] [] I2 = [?] [?] I3 = [?] [?] I4 = [?] [?] |
Initialization of input frame I1 from the method's descriptor, B1
marked as "changed" |
| 1 |
B3 |
|
I1= [I] [] I2 = [?] [?] I3 = [IB] [] I4 = [?] [?] |
B1 marked as "unchanged", merge of I1 and O1 into I3 (B3 is a successor
of B1), B3 marked as "changed" |
| 2 |
B2, B4 |
|
I1= [I] [] I2 = [IB][] I3 = [IB] [] I4 = [IB] [] |
B3 marked as "unchanged", merge of I3 and O3 into I2 (B2 is a successor
of B3), B2 marked as changed, merge of I3 and O3 into I4 (B4 is as
successor of B3), B4 marked as changed |
| 3 |
B4, B3 |
|
I1= [I] [] I2 = [IB] [] I3 = [IA] [] I4 = [IB] [] |
B2 marked as "unchanged", merge of I2 and O2 into I3 (B3 is a successor
of B2), B3 marked as changed. |
| 4 |
B3 |
|
I1= [I] [] I2 = [IB] [] I3 = [IA] [] I4 = [IB] [] |
B4 marked as "unchanged" |
| 5 |
B2, B4 |
|
I1= [I] [] I2 = [IA] [] I3 = [IA] [] I4 = [IA] [] |
B3 marked as "unchanged", merge of I3 and O3 into I2 (B2 is a successor
of B3), B2 marked as changed, merge of I3 and O3 into I4 (B4 is as
successor of B3), B4 marked as changed |
| 6 |
B4 |
|
I1= [I] [] I2 = [IA] [] I3 = [IA] [] I4 = [IA] [] |
B2 marked as "unchanged", merge of I2 and O2 into I3 (B3 is a successor
of B2), B3 not marked as changed. |
| 7 |
|
|
I1= [I] [] I2 = [IA] [] I3 = [IA] [] I4 = [IA] [] |
B4 marked as "unchanged" |
|
3.5.2 Uninitialized types
The simulation of a NEW T instruction results in a
special uninitialized type being pushed on the stack. This special type
contains the offset of the NEW instruction that created it. When
an INVOKESPECIAL instruction for a T constructor is
simulated, all occurrences of this special type in the current stack map frame
must be replaced with the normal T type.
The basic block input and output frame data structures presented in the
previous section are not sufficient to take uninitialized types into account.
Indeed, when a constructor invocation is visited, the type of the target may be
unknown, and many local variable and operand stack types may also be unknown. It
is therefore impossible to replace all occurrences of the target type everywhere
in the stack map frame.
For this reason an additional data structure is associated to each basic
block, namely the list of types that are initialized in this basic block (these
types can be relative to the unknown input stack map frame of the basic block).
This data structure is constructed during the visit of instructions, and is
used in visitMaxs, when all the types are known.
3.5.3 Exception handlers
For all the instructions covered by an exception handler, the control flow
can jump to the exception handler block. This means that, inside the region
covered by an exception handler, and as a consequence of the definition of
basic blocks, basic blocks are reduced to individual instructions. In this case
the advantage of using an algorithm working at the basic block level is lost,
since there are as many basic blocks as instructions.
Hopefully it is not necessary to really use one basic block per instruction
inside regions covered by exception handlers. This is because not all the
frames associated to these instructions have an impact on the input frame of
the exception handler. Indeed this input frame only contains local variable
types, and its stack is reduced to a single element that depends only on the
type of exception caught by this handler. As a consequence only the frames
associated to the instructions that affect local variables are important. In
practice, this means that, inside regions covered by exception handlers,
xSTORE instructions end the current basic block (and start a
new one) like, for instance, an IFEQ instruction.
As an example, consider the following method:
public m(Ljava/lang/Integer;Ljava/lang/Float;)Ljava/lang/Number;
TRYCATCHBLOCK L0 L1 L1 java/lang/Exception
ACONST_NULL
ASTORE 3
L0
ALOAD 1
ASTORE 3
ALOAD 2
ASTORE 3
ALOAD 3
ARETURN
L1
ASTORE 4
ALOAD 3
ARETURN
|
Normally, due to the exception handler, each instruction between L0 and L1
should be a distinct basic block, which would give 6 basic blocks inside this
region. In practice, thanks to the above optimization, only the
ASTORE instructions change the current basic block, which gives 3
basic blocks (ALOAD 1 ASTORE 3, ALOAD 2 ASTORE 3 and
ALOAD 3 ARETURN). Note that if the instructions between L0 and L1
were considered as a single basic block, the frame computed for L1 would be
incorrect: it would indeed be the same as the output frame of the previous
block, where local variable 3 is of type Float (whereas the correct
value is the common super type of Integer and Float,
i.e. Number).
Note: the edges of the control flow graph that connect basic blocks
to exception handler blocks are not constructed in the
visitTryCatchBlock method but in the visitMaxs method.
Indeed, since the visitTryCatchBlock method must be called before
the labels passed as arguments to this method are visited, it is not possible
to know in this method which labels belong to the range protected by this
handler.
3.5.4 Dead code
The fix point algorithm used in the second step of the algorithm described
in section 3.5.1 is limited to reachable code. Indeed,
by definition, reachable code is code that can be reached from the initial basic
block in the control flow graph, and the fix point algorithm is precisely
looking at these blocks. As a consequence the algorithm does not compute the
input frame of dead basic blocks.
Unfortunately the Java 6 split verifier requires a stack map frame for every
basic block, even unreachable ones. The problem, as shown above, is that these
frames are not computed and, even worse, cannot be computed. Indeed an
unreachable basic block can contain illegal bytecode sequences, such as
ISTORE 1 ALOAD 1 (more precisely this was possible with the
JVM verifier prior to Java 6; but this is no longer possible with the new
verifier).
The consequence of all this is that dead code must either be removed or
replaced with a valid bytecode sequence whose stack map frame can be easily
computed. The first solution was judged too complicated (although the
resizeInstructions method could help). So the second solution has
been chosen.
A simple solution is to replace dead code with NOP
instructions. In this case any stack map frame will be ok for these blocks. The
only problem is that execution can fall from the end of a dead code block to
the end of the method or to the start of a reachable block. So either the stack
map frame for the dead code block must be consistent with the frame of the next
block, or the last instruction of the dead code block must not be replaced with
a NOP but with an instruction without successor, such as
RETURN, GOTO, or ATHROW.
The first solution is too complicated; the second one is possible, but the
fact that the dead code block can be reduced to a single byte must be taken
into account: there is not always enough room to replace it with NOP
instructions followed by a GOTO, for example.
xRETURN can be used (this is a single byte instruction), but
this requires adjustments to the method's return type. The ATHROW
instruction does not have this problem, and is also a single byte instruction.
It was therefore chosen to end the dead code blocks.
Note that it is not necessary to insert instructions to create something on
stack before ATHROW: the stack map frame for this block can
"declare" that, when the execution of this basic block begins, there is already
an exception on stack (there will be no consistency problem with other frames
since this block is unreachable. Also there is no problem with declared
exceptions, because the verifier does not check that ATHROW
instructions are consistent with declared exceptions).
Choosing ATHROW and declaring a stack map frame of the form []
[java/lang/Throwable] for dead basic blocks gives the additional benefit of not
requiring any adjustment to exception handlers. Indeed, if a dead basic block
is an exception handler, either its stack map frame must be consistent with []
[exception type], or this exception handler declaration must be removed. Since
[] [java/lang/Throwable] is consistent with any [] [exception type] frame,
exception handler declarations for dead code blocks do not need to be removed,
which saves code.
In summary, the solution for dead code blocks is to replace these blocks
with NOP ... NOP ATHROW, and to declare
a [] [java/lang/Throwable] stack map frame for these blocks.
3.5.5 Subroutines
The JSR and RET instructions, used for
subroutines, complicate the control flow and data flow analysis algorithms.
Hopefully they must not be used with Java 6 classes, so they do not have any
impact on the algorithm used to compute stack map frames. But they do
complicate the algorithm used to compute the maximum stack size, with the
COMPUTE_MAXS option. More precisely they do not impact the
algorithm used to compute the maximum stack size from the control flow graph,
but they complicate the construction of this graph.
Like for exception handlers, the edges in the control flow graph that
correspond to subroutines are computed in visitMaxs.
The first step consists in finding, for each basic block, to which
subroutine(s) it belongs. All the basic blocks that are reachable from the
first one, without following a JSR or RET instruction,
are marked as belonging to a main "subroutine". Likewise, for all
JSR targets, the basic blocks reachable from this target are marked
as belonging to the subroutine defined by this JSR target.
The second step consists in finding, for each RET instruction,
to which possible basic blocks it can return to. The possible basic blocks are
those that follow JSR instructions. We therefore examine each
JSR instruction in turn. For each such JSR
instruction I we look at the basic blocks that belong to the called
subroutine. When a RET instruction is found we add the block
following I as a successor of the RET instruction (except
if I and the RET instruction belong to a common subroutine,
which can happen if a subroutine returns to its parent subroutine without a
RET).
Let's take an example to illustrate this:
public m(Z)V
L0
JSR L2
L1
RETURN
L2
ASTORE 2
JSR L4
L3
GOTO L5
L4
ASTORE 3
ILOAD 1
IFEQ L5
RET 3
L5
RET 2
|
After all the instructions have been visited in MethodWriter, and
just before visitMaxs is called, the control flow graph is the
following (L1 and L3 are not real successors of L0 and L2: these edges are
only used to keep track of JSR's return addresses, and are
ignored during the control flow graph analysis used to compute the maximum
stack size):
- L0 successors: L2, L1
- L1 successors: none
- L2 successors: L4, L3
- L3 successors: L5
- L4 successors: L5
- L5 successors: none
As explained above, the first step in visitMaxs consists in
finding, for each basic block, to which subroutine(s) it belongs. The blocks
that are reachable from L0 without following a JSR or a
RET are L0 and L1: they are marked as belonging to subroutine
#0. The blocks that are reachable in the same way from the subroutine
starting at L2 are L2, L3 and L5. They are marked as belonging to
subroutine #1. Finally the blocks that are reachable in the same way from
the subroutine starting at L4 are L4 and L5. They are marked as belonging to
subroutine #2. Note that L5 is marked as belonging to two subroutines. This
is due to the fact the nested subroutine #2 can "return" to its parent
subroutine #1 without a RET:
- L0 belongs to: subroutine #0
- L1 belongs to: subroutine #0
- L2 belongs to: subroutine #1
- L3 belongs to: subroutine #1
- L4 belongs to: subroutine #2
- L5 belongs to: subroutine #1 and subroutine #2
The second step consits in finding the successors of the RET
instructions. As explained above, this is done by examining each
JSR instruction in turn. The first one, in L0, leads to the
analysis of the basic blocks that belong to subroutine #1. In these basic
blocks we find only one RET instruction, in L5. These
JSR and RET instructions do not belong to a
common subroutine, so we add L1 (the next basic block after the
JSR in L0) as a successor of L5.
The second JSR instruction, in L2, leads to the analysis
of the basic blocks of subroutine #2. In these basic blocks we find two
RET instructions, in L4 and L5. L2 and L4 do not belong to
a common subroutine, so we add L3 (the next basic block after the
JSR in L2) as a successor of L4. But L2 and L5 belong to a
common subroutine (subroutine #1), so we do not add L3 as a
successor of L5.
The final control flow graph is the following:
- L0 successors: L2, L1
- L1 successors: none
- L2 successors: L4, L3
- L3 successors: L5
- L4 successors: L5, L3
- L5 successors: L1
Note: you may have noticed that the L4 "basic block" is not a real
basic block, because it contains several instructions that can lead to other
blocks. In fact it is the union of two consecutive basic blocks. This is not
an error: with the COMPUTE_MAXS option it is not always necessary
to decompose the control flow graph down to individual basic blocks. Therefore,
when possible, several consecutive basic blocks are represented as a single
block, for optimization purposes.
4 Code optimizations
The main objective of ASM is to get the smallest and fastest code as
possible, while keeping a quite "clean" public API. This section explains the
techniques used in this project to achieve these objectives (many of
them are not recommended for mainstream development).
4.1 Optimizing performances
The first and most important step to get good performances is to use the
right API and good algorithms [R0]. Sometimes it is hard to decide which API or
algorithm is best without actually implementing and testing all options. In
this case, take the time to implement and test all options to find the best
one. There are however some general rules that can be used when designing an
API and when implementing it, in order to get good performances (a side effect
of these rules is to reduce code size):
- [R1] the API must stay very close to the internal class file structures, in
order to avoid costly conversions during parsing and writing. Many examples of
this rule can be seen in the existing API: internal class names, type
descriptors and signatures are passed as argument in visit methods in the same
form as they are stored in the class file. Stack map frames are also visited as
they as stored in the class file, i.e. in a compressed form (although there is
an option to uncompress them). The drawback of this rule is when several class
adapters need a high level view of these encoded structures (for example a
SignatureVisitor based view of a signature string): indeed, in this
case, the decoding and encoding steps will be executed in each adapter, while
they could be executed only once in ClassReader and
ClassWriter.
- [R2] the implementation must not include any check or verification, at any
level (class file parsing, preconditions of methods, validity of bytecode
sequences, etc). These verifications must be added in
CheckXAdapter classes.
Once the best API and algorithms have been found, several "low level"
techniques can be used to optimize performances:
- [R3] avoid string manipulation operations. These operations generally have a
high cost. Examples of this can be seen in
ClassReader: the
strings array is used to avoid parsing and building strings several
times for the same constant pool UTF8 item. Another example is in the
Label class: the types used for stack map frames are encoded as int
values instead of strings (they were initially stored as strings; the change to
int values improved performances a lot).
- [R4] avoid array copy operations. Several examples of this principle can be
seen in the existing implementation. For example the
toByteArray
method first computes the size of the class, then allocates a
ByteVector of this size, and finally writes the class content in
this vector. This avoids many calls to the enlarge method in
ByteVector, and therefore many array copy operations.
- [R5] avoid getter and setter methods, at least for non public or protected
fields. Some JVMs do not inline these methods, in which case accessing fields
directly is faster than using getter and setters. This also saves code. The core
package does not use any such method. The tree package also exposes many fields
directly.
- [R6] copy frequently accessed instance fields to local variables to reduce
field access operations (this is the logical continuation of the previous
technique). See for example the
ByteVector class: the
length and data fields are copied into local variables
in methods that access them several times. A variant of this rule is to replace
fields with method parameters (this variant was used in the
signature package - see
section 4.3).
- [R7] cache the result of complex functions in order to execute them only
once. This rule is used, for example, in the
visitMethod method in
MethodWriter: the result of getArgumentsAndReturnSizes
for a method Item is cached in an unused field of method
Items. It is also used in ClassReader: the
strings array is used as a cache to avoid parsing the same string
constant pool item several times.
- [R8] use
int coded data structures in order to replace slow
data structure access methods with fast arithmetic or bitwise operations (this
also reduces code size). Several examples of this rule can be seen in the
existing code: the boolean stack encoded as an int in
SignatureWriter, the verification types encoded as int
in Label, etc.
- [R9] use
switch statements that can be compiled into
TABLESWITCH instead of LOOKUPSWITCH (the first opcode
executes faster), where possible (this requires switch case constants that are
not sparse).
These techniques must be used with care (because they can make code harder
to understand and to maintain) and only when they bring a real performance
benefit. In order to see if it is the case, the performances of every proposed
optimization must be compared to the performances of the current code (if
possible on several JVMs, in order to exclude singularities due to specific JVM
optimization strategies).
Of course optimizations must be applied in priority to the code that is
responsible for the most important part of the total execution time, in order
to get the best performance improvements. In order to detect these "hotspots",
a code profiler can be used. It is also possible to turn some code sections on
or off, in order to measure, by difference to the normal case, the cost of this
section.
4.2 Optimizing code size
The obvious way to reduce code size is to reduce the number of classes and
methods! This begins by reducing the number of interfaces in the public API
(not because interfaces themselves are big, but because it indirectly reduces
the number of classes and methods) [R10]. This reduction generally requires
compromises with what a "clean" object oriented API would be. Whether these
compromises are worthwhile or not must be decided on a case by case basis.
Several such compromises can be detected in the current public ASM API. For
example the AnnotationVisitor interface is used both for
annotations and annotation arrays although, in the second case, the
name argument is useless and would not have been used if an
AnnotationArrayVisitor interface were defined. Another example is
the SignatureVisitor interface: a single interface is used to
represent a quite complex recursive data structure that, in a "normal" object
oriented design, would be represented with up to 8 interfaces.
The number of classes can be reduced in several ways. The most "elegant" one
is to use appropriate design patterns [R11]. For instance the
ClassReader class needs Attribute factories in order to
create attribute objects for the class, method and field attributes. But
instead of using the Factory design pattern, which requires two classes, ASM
uses the Prototype pattern, which uses only one class.
The number of classes can also be reduced by merging several related classes
into a single one [R12]. For instance a single Item class is used
to represent the eleven constant pool item types, and the Label
class is used to represent both bytecode positions and basic blocks. The
drawback of this approach is that it increases memory requirements (an integer
Item object, for example, uses 11 fields while an
IntegerItem class would require only 4 fields) and object
instantiation times (Label and Frame classes were
initially merged but have been separated in order to improve performances).
The number of methods can be reduced by inlining methods that are called at
only one place in the code (a method that is called from several places can
also be inlined if the client code can be refactored to call it at only one
place) [R13]. This technique has been used in the ClassReader
class: the accept method is very big because it inlines many
auxiliary methods that, in a normal object oriented design, would have been
defined separately in order to get methods of reasonable sizes. Another
benefit of this technique is that it increases performances.
Finally some tricks can be used in very specific cases to reduce code
size:
- [R14] use string constant values directly in code, instead of declaring them
in
final static String fields. It saves space by removing field
declarations.
- [R15] avoid using array initializers like
{
value1, ... valueN }. These
initializers can be replaced with a loop that initializes each array element
from a string representation of the whole array. This technique is used in
ClassWriter and asm.util.AbstractVisitor, for
example.
- [R16] some switch instructions can be replaced with an array access. For
instance a switch that associates A, C or Z to 0, 1 or 3 respectively can be
replaced with
new char[] {
'A', 'C', 0, 'Z'
}[i], which can itself be optimized in
"AC Z".charAt(i) (by using [R15]).
The last technique used in ASM to reduce code size is to rename private and
package class members with shorter names, and to sort the constants in the
constant pools (in order to get better jar compression ratios).
This is done with the asm.optimizer package.
4.3 An example
This section illustrates the previous optimization techniques on a real
example, namely the asm.signature package. The goal of this package
is to provide an event based API, a parser and a writer for the generics
signature grammar, given below:
ClassSignature:
FormalTypeParameters? ClassTypeSignature
ClassTypeSignature*
FormalTypeParameters:
< FormalTypeParameter+
>
FormalTypeParameter:
Identifier FieldTypeSignature?
FieldTypeSignature*
FieldTypeSignature:
ClassTypeSignature | ArrayTypeSignature |
TypeVariableSignature
ClassTypeSignature:
L Identifier (
/ Identifier )*
TypeArguments? ( .
Identifier TypeArguments? )*
;
TypeArguments:
< TypeArgument+
>
TypeArgument:
* | ( ( + |
- )? FieldTypeSignature )
ArrayTypeSignature:
[ TypeSignature
TypeVariableSignature:
T Identifier
;
TypeSignature:
Z | C |
B | S |
I | F |
J | D |
FieldTypeSignature
MethodTypeSignature:
FormalTypeParameters? (
TypeSignature* ) ( TypeSignature |
V ) (
^ClassTypeSignature |
^TypeVariableSignature )*
|
The first design of this package used 7 interfaces, roughly corresponding to
the grammar non terminals:
ClassSignatureVisitor
FormalTypeParameterVisitor
FieldTypeSignatureVisitor
ClassTypeSignatureVisitor
TypeArgumentsVisitor
TypeSignatureVisitor
MethodTypeSignatureVisitor
The parser was generated with JavaCC and the writer used 7
classes, one per interface.
The second step was to replace the generated parser with a hand written,
recursive descent parser, containing 11 recursive methods (one per rule of the
grammar). This led to major performance improvements, and to a huge code size
reduction. This was due to the fact that hand written code can be more
efficient than generated code, and also to the fact that all lexical and
syntactic verifications were removed ([R2]).
In the third step a refactoring reduced the number of writer classes to 6
instead of 7 (by using inheritance).
In the fourth step the number of interfaces was reduced to 4 ([R10]), and
the 6 writer classes were merged into a single one ([R12]), using a boolean
stack encoded as an int ([R8]). The parser was mostly unchanged.
The reduced API allowed invalid signatures to be generated (it corresponded to
a generalized grammar where FieldType, ClassType ,
ArrayType, TypeVariable and Type are merged into a single
Type non terminal), but this was seen as an acceptable compromise.
In the sixth step the parser was optimized by replacing some fields that
represented the last parsed character and its index with method arguments
([R6]). In addition some methods were inlined, which removed 6 parsing methods
out of 11 ([R13]).
Finally, after some feedback from users, the API was reduced to only one
interface (although it allowed even more invalid signatures to be generated,
this was judged more practical to define SignatureAdapters), and
this provided more opportunities to inline parser methods: the final parser
contains only 3 parsing methods.
The code size decreased from 22KB at the first step, to 6.8KB at the second
step, 4.5KB at the fourth step, and less than 4KB at the last step. At the same
time the average time to parse and rebuild a signature decreased from 74 micro
seconds at the first step, to 15 micro seconds at the second step, 5.4 micro
seconds at the fourth step, and less than 5 micro seconds at the last step.
As can be seen from this example (more details can be found on the
ASM
mailing list archives of January 2005), improving performances often leads
to code size reductions, and vice versa!
|