Why is “2 * (i * i)” faster than “2 * i * i”?
up vote
15
down vote
favorite
The following Java program takes on average between 0.50s and 0.55s to run:
public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?
Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:
2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789
2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526
The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.
java optimization
|
show 9 more comments
up vote
15
down vote
favorite
The following Java program takes on average between 0.50s and 0.55s to run:
public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?
Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:
2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789
2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526
The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.
java optimization
5
How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
1 hour ago
2
I ran each version around 10 times, and the2 * i * iversion always takes slightly longer.
– Stefan
1 hour ago
4
look at the bytecode using a dissassmber.
– OldProgrammer
1 hour ago
3
Also please see: stackoverflow.com/questions/504103/…
– lexicore
1 hour ago
3
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
42 mins ago
|
show 9 more comments
up vote
15
down vote
favorite
up vote
15
down vote
favorite
The following Java program takes on average between 0.50s and 0.55s to run:
public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?
Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:
2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789
2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526
The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.
java optimization
The following Java program takes on average between 0.50s and 0.55s to run:
public static void main(String args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?
Edit: I ran each version of the program 15 times, alternating between the two. Here are the results:
2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789
2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526
The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.
java optimization
java optimization
edited 1 hour ago
asked 1 hour ago
Stefan
1175
1175
5
How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
1 hour ago
2
I ran each version around 10 times, and the2 * i * iversion always takes slightly longer.
– Stefan
1 hour ago
4
look at the bytecode using a dissassmber.
– OldProgrammer
1 hour ago
3
Also please see: stackoverflow.com/questions/504103/…
– lexicore
1 hour ago
3
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
42 mins ago
|
show 9 more comments
5
How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
1 hour ago
2
I ran each version around 10 times, and the2 * i * iversion always takes slightly longer.
– Stefan
1 hour ago
4
look at the bytecode using a dissassmber.
– OldProgrammer
1 hour ago
3
Also please see: stackoverflow.com/questions/504103/…
– lexicore
1 hour ago
3
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
42 mins ago
5
5
How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
1 hour ago
How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
1 hour ago
2
2
I ran each version around 10 times, and the
2 * i * i version always takes slightly longer.– Stefan
1 hour ago
I ran each version around 10 times, and the
2 * i * i version always takes slightly longer.– Stefan
1 hour ago
4
4
look at the bytecode using a dissassmber.
– OldProgrammer
1 hour ago
look at the bytecode using a dissassmber.
– OldProgrammer
1 hour ago
3
3
Also please see: stackoverflow.com/questions/504103/…
– lexicore
1 hour ago
Also please see: stackoverflow.com/questions/504103/…
– lexicore
1 hour ago
3
3
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
42 mins ago
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
42 mins ago
|
show 9 more comments
5 Answers
5
active
oldest
votes
up vote
4
down vote
When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.
Here are a few reasons why I think this is the case:
- Adding an
if (n == 0) n = 1statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same - The optimized version (by factoring out the multiplication by 2) is exactly as fast as the
2 * (i * i)version
Here is the test code that I used to draw these conclusions:
public static void main(String args) {
long fastVersion = 0;
long slowVersion = 0;
long optimizedVersion = 0;
long modifiedFastVersion = 0;
long modifiedSlowVersion = 0;
for (int i = 0; i < 10; i++) {
fastVersion += fastVersion();
slowVersion += slowVersion();
optimizedVersion += optimizedVersion();
modifiedFastVersion += modifiedFastVersion();
modifiedSlowVersion += modifiedSlowVersion();
}
System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}
private static long fastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long slowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
private static long optimizedVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
return System.nanoTime() - startTime;
}
private static long modifiedFastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long modifiedSlowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
And here are the results:
Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s
add a comment |
up vote
1
down vote
ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html
ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer
On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:
public static void main(String args) {
int repeat = 10;
long A = 0;
long B = 0;
for (int i = 0; i < repeat; i++) {
A += test();
B += testB();
}
System.out.println(A / repeat + " ms");
System.out.println(B / repeat + " ms");
}
private static long test() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multi(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multi(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms A " + n);
return ms;
}
private static long testB() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multiB(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multiB(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms B " + n);
return ms;
}
private static int multiB(int i) {
return 2 * (i * i);
}
private static int multi(int i) {
return 2 * i * i;
}
Output:
...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms
So why?
The Byte code is this:
private static multiB(int arg0) { // 2 * (i * i)
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
iload0
imul
imul
ireturn
}
L2 {
}
}
private static multi(int arg0) { // 2 * i * i
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
imul
iload0
imul
ireturn
}
L2 {
}
}
The difference being:
With brackets (2 * (i * i)):
- push const stack
- push local on stack
- push local on stack
- multiply top of stack
- multiply top of stack
Without brackets (2 * i * i):
- push const stack
- push local on stack
- multiply top of stack
- push local on stack
- multiply top of stack
Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.
add a comment |
up vote
1
down vote
I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
@Param({ "100", "1000", "1000000000" })
private int size;
@Benchmark
public int two_square_i() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * (i * i);
}
return n;
}
@Benchmark
public int square_i_two() {
int n = 0;
for (int i = 0; i < size; i++) {
n += i * i;
}
return 2*n;
}
@Benchmark
public int two_i_() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * i * i;
}
return n;
}
}
The result are here:
Benchmark (size) Mode Samples Score Score error Units
o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op
On my PC (Core i7 860, doing nothing much apart reading on my smartphone):
n += i*ithenn*2is first
2 * (i * i)is second.
The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).
Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class
- Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI
- Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP
I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.
add a comment |
up vote
-1
down vote
I can't reproduce your claim. This java fiddle mostly gets the "opposite" result:

1
Note - I wouldn't trust running a profile through an online service that is sharing CPU time with all the other requests going through it - so much more margin for error this way.
– Krease
1 hour ago
3
If you can't reproduce the results, comment and vote as such. This is not an answer.
– Sotirios Delimanolis
1 hour ago
1
Not completely resistant, but definitely much more resistant, since my own machine isn't hosting a webapp allowing people to run arbitrary programs on it
– Krease
1 hour ago
1
switch them in the main and your result should switch... this is not an answer
– DSchmidt
50 mins ago
2
OK, it's not an answer - can't post an image in a comment, and that fiddle can't save a url to put in a comment.. I think this question isn't a question, as much as this answer isn't an answer
– Caius Jard
49 mins ago
|
show 3 more comments
up vote
-2
down vote
I got similar results:
2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736
I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.
Finally, here is a javap -c -v <.java> decompile of each:
3: ldc #3 // String 2 * (i * i):
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: iload 4
30: imul
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
vs.
3: ldc #3 // String 2 * i * i:
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: imul
29: iload 4
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
FYI -
java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
1
A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
– nullpointer
1 hour ago
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
– paulsm4
57 mins ago
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
– Krease
55 mins ago
With JIT, the bytecode doesn't matter much. Show the JIT code.
– rustyx
13 mins ago
add a comment |
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.
Here are a few reasons why I think this is the case:
- Adding an
if (n == 0) n = 1statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same - The optimized version (by factoring out the multiplication by 2) is exactly as fast as the
2 * (i * i)version
Here is the test code that I used to draw these conclusions:
public static void main(String args) {
long fastVersion = 0;
long slowVersion = 0;
long optimizedVersion = 0;
long modifiedFastVersion = 0;
long modifiedSlowVersion = 0;
for (int i = 0; i < 10; i++) {
fastVersion += fastVersion();
slowVersion += slowVersion();
optimizedVersion += optimizedVersion();
modifiedFastVersion += modifiedFastVersion();
modifiedSlowVersion += modifiedSlowVersion();
}
System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}
private static long fastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long slowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
private static long optimizedVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
return System.nanoTime() - startTime;
}
private static long modifiedFastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long modifiedSlowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
And here are the results:
Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s
add a comment |
up vote
4
down vote
When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.
Here are a few reasons why I think this is the case:
- Adding an
if (n == 0) n = 1statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same - The optimized version (by factoring out the multiplication by 2) is exactly as fast as the
2 * (i * i)version
Here is the test code that I used to draw these conclusions:
public static void main(String args) {
long fastVersion = 0;
long slowVersion = 0;
long optimizedVersion = 0;
long modifiedFastVersion = 0;
long modifiedSlowVersion = 0;
for (int i = 0; i < 10; i++) {
fastVersion += fastVersion();
slowVersion += slowVersion();
optimizedVersion += optimizedVersion();
modifiedFastVersion += modifiedFastVersion();
modifiedSlowVersion += modifiedSlowVersion();
}
System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}
private static long fastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long slowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
private static long optimizedVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
return System.nanoTime() - startTime;
}
private static long modifiedFastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long modifiedSlowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
And here are the results:
Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s
add a comment |
up vote
4
down vote
up vote
4
down vote
When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.
Here are a few reasons why I think this is the case:
- Adding an
if (n == 0) n = 1statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same - The optimized version (by factoring out the multiplication by 2) is exactly as fast as the
2 * (i * i)version
Here is the test code that I used to draw these conclusions:
public static void main(String args) {
long fastVersion = 0;
long slowVersion = 0;
long optimizedVersion = 0;
long modifiedFastVersion = 0;
long modifiedSlowVersion = 0;
for (int i = 0; i < 10; i++) {
fastVersion += fastVersion();
slowVersion += slowVersion();
optimizedVersion += optimizedVersion();
modifiedFastVersion += modifiedFastVersion();
modifiedSlowVersion += modifiedSlowVersion();
}
System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}
private static long fastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long slowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
private static long optimizedVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
return System.nanoTime() - startTime;
}
private static long modifiedFastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long modifiedSlowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
And here are the results:
Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s
When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.
Here are a few reasons why I think this is the case:
- Adding an
if (n == 0) n = 1statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same - The optimized version (by factoring out the multiplication by 2) is exactly as fast as the
2 * (i * i)version
Here is the test code that I used to draw these conclusions:
public static void main(String args) {
long fastVersion = 0;
long slowVersion = 0;
long optimizedVersion = 0;
long modifiedFastVersion = 0;
long modifiedSlowVersion = 0;
for (int i = 0; i < 10; i++) {
fastVersion += fastVersion();
slowVersion += slowVersion();
optimizedVersion += optimizedVersion();
modifiedFastVersion += modifiedFastVersion();
modifiedSlowVersion += modifiedSlowVersion();
}
System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}
private static long fastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long slowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
private static long optimizedVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
return System.nanoTime() - startTime;
}
private static long modifiedFastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long modifiedSlowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
And here are the results:
Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s
answered 27 mins ago
Runemoro
1,35011135
1,35011135
add a comment |
add a comment |
up vote
1
down vote
ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html
ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer
On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:
public static void main(String args) {
int repeat = 10;
long A = 0;
long B = 0;
for (int i = 0; i < repeat; i++) {
A += test();
B += testB();
}
System.out.println(A / repeat + " ms");
System.out.println(B / repeat + " ms");
}
private static long test() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multi(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multi(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms A " + n);
return ms;
}
private static long testB() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multiB(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multiB(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms B " + n);
return ms;
}
private static int multiB(int i) {
return 2 * (i * i);
}
private static int multi(int i) {
return 2 * i * i;
}
Output:
...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms
So why?
The Byte code is this:
private static multiB(int arg0) { // 2 * (i * i)
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
iload0
imul
imul
ireturn
}
L2 {
}
}
private static multi(int arg0) { // 2 * i * i
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
imul
iload0
imul
ireturn
}
L2 {
}
}
The difference being:
With brackets (2 * (i * i)):
- push const stack
- push local on stack
- push local on stack
- multiply top of stack
- multiply top of stack
Without brackets (2 * i * i):
- push const stack
- push local on stack
- multiply top of stack
- push local on stack
- multiply top of stack
Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.
add a comment |
up vote
1
down vote
ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html
ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer
On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:
public static void main(String args) {
int repeat = 10;
long A = 0;
long B = 0;
for (int i = 0; i < repeat; i++) {
A += test();
B += testB();
}
System.out.println(A / repeat + " ms");
System.out.println(B / repeat + " ms");
}
private static long test() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multi(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multi(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms A " + n);
return ms;
}
private static long testB() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multiB(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multiB(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms B " + n);
return ms;
}
private static int multiB(int i) {
return 2 * (i * i);
}
private static int multi(int i) {
return 2 * i * i;
}
Output:
...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms
So why?
The Byte code is this:
private static multiB(int arg0) { // 2 * (i * i)
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
iload0
imul
imul
ireturn
}
L2 {
}
}
private static multi(int arg0) { // 2 * i * i
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
imul
iload0
imul
ireturn
}
L2 {
}
}
The difference being:
With brackets (2 * (i * i)):
- push const stack
- push local on stack
- push local on stack
- multiply top of stack
- multiply top of stack
Without brackets (2 * i * i):
- push const stack
- push local on stack
- multiply top of stack
- push local on stack
- multiply top of stack
Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.
add a comment |
up vote
1
down vote
up vote
1
down vote
ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html
ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer
On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:
public static void main(String args) {
int repeat = 10;
long A = 0;
long B = 0;
for (int i = 0; i < repeat; i++) {
A += test();
B += testB();
}
System.out.println(A / repeat + " ms");
System.out.println(B / repeat + " ms");
}
private static long test() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multi(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multi(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms A " + n);
return ms;
}
private static long testB() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multiB(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multiB(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms B " + n);
return ms;
}
private static int multiB(int i) {
return 2 * (i * i);
}
private static int multi(int i) {
return 2 * i * i;
}
Output:
...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms
So why?
The Byte code is this:
private static multiB(int arg0) { // 2 * (i * i)
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
iload0
imul
imul
ireturn
}
L2 {
}
}
private static multi(int arg0) { // 2 * i * i
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
imul
iload0
imul
ireturn
}
L2 {
}
}
The difference being:
With brackets (2 * (i * i)):
- push const stack
- push local on stack
- push local on stack
- multiply top of stack
- multiply top of stack
Without brackets (2 * i * i):
- push const stack
- push local on stack
- multiply top of stack
- push local on stack
- multiply top of stack
Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.
ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html
ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer
On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:
public static void main(String args) {
int repeat = 10;
long A = 0;
long B = 0;
for (int i = 0; i < repeat; i++) {
A += test();
B += testB();
}
System.out.println(A / repeat + " ms");
System.out.println(B / repeat + " ms");
}
private static long test() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multi(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multi(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms A " + n);
return ms;
}
private static long testB() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multiB(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multiB(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms B " + n);
return ms;
}
private static int multiB(int i) {
return 2 * (i * i);
}
private static int multi(int i) {
return 2 * i * i;
}
Output:
...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms
So why?
The Byte code is this:
private static multiB(int arg0) { // 2 * (i * i)
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
iload0
imul
imul
ireturn
}
L2 {
}
}
private static multi(int arg0) { // 2 * i * i
<localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
imul
iload0
imul
ireturn
}
L2 {
}
}
The difference being:
With brackets (2 * (i * i)):
- push const stack
- push local on stack
- push local on stack
- multiply top of stack
- multiply top of stack
Without brackets (2 * i * i):
- push const stack
- push local on stack
- multiply top of stack
- push local on stack
- multiply top of stack
Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.
edited 42 mins ago
answered 53 mins ago
DSchmidt
18019
18019
add a comment |
add a comment |
up vote
1
down vote
I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
@Param({ "100", "1000", "1000000000" })
private int size;
@Benchmark
public int two_square_i() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * (i * i);
}
return n;
}
@Benchmark
public int square_i_two() {
int n = 0;
for (int i = 0; i < size; i++) {
n += i * i;
}
return 2*n;
}
@Benchmark
public int two_i_() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * i * i;
}
return n;
}
}
The result are here:
Benchmark (size) Mode Samples Score Score error Units
o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op
On my PC (Core i7 860, doing nothing much apart reading on my smartphone):
n += i*ithenn*2is first
2 * (i * i)is second.
The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).
Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class
- Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI
- Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP
I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.
add a comment |
up vote
1
down vote
I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
@Param({ "100", "1000", "1000000000" })
private int size;
@Benchmark
public int two_square_i() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * (i * i);
}
return n;
}
@Benchmark
public int square_i_two() {
int n = 0;
for (int i = 0; i < size; i++) {
n += i * i;
}
return 2*n;
}
@Benchmark
public int two_i_() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * i * i;
}
return n;
}
}
The result are here:
Benchmark (size) Mode Samples Score Score error Units
o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op
On my PC (Core i7 860, doing nothing much apart reading on my smartphone):
n += i*ithenn*2is first
2 * (i * i)is second.
The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).
Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class
- Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI
- Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP
I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.
add a comment |
up vote
1
down vote
up vote
1
down vote
I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
@Param({ "100", "1000", "1000000000" })
private int size;
@Benchmark
public int two_square_i() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * (i * i);
}
return n;
}
@Benchmark
public int square_i_two() {
int n = 0;
for (int i = 0; i < size; i++) {
n += i * i;
}
return 2*n;
}
@Benchmark
public int two_i_() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * i * i;
}
return n;
}
}
The result are here:
Benchmark (size) Mode Samples Score Score error Units
o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op
On my PC (Core i7 860, doing nothing much apart reading on my smartphone):
n += i*ithenn*2is first
2 * (i * i)is second.
The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).
Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class
- Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI
- Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP
I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.
I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
@Param({ "100", "1000", "1000000000" })
private int size;
@Benchmark
public int two_square_i() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * (i * i);
}
return n;
}
@Benchmark
public int square_i_two() {
int n = 0;
for (int i = 0; i < size; i++) {
n += i * i;
}
return 2*n;
}
@Benchmark
public int two_i_() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * i * i;
}
return n;
}
}
The result are here:
Benchmark (size) Mode Samples Score Score error Units
o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op
On my PC (Core i7 860, doing nothing much apart reading on my smartphone):
n += i*ithenn*2is first
2 * (i * i)is second.
The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).
Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class
- Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI
- Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP
I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.
answered 1 min ago
NoDataFound
5,3161739
5,3161739
add a comment |
add a comment |
up vote
-1
down vote
I can't reproduce your claim. This java fiddle mostly gets the "opposite" result:

1
Note - I wouldn't trust running a profile through an online service that is sharing CPU time with all the other requests going through it - so much more margin for error this way.
– Krease
1 hour ago
3
If you can't reproduce the results, comment and vote as such. This is not an answer.
– Sotirios Delimanolis
1 hour ago
1
Not completely resistant, but definitely much more resistant, since my own machine isn't hosting a webapp allowing people to run arbitrary programs on it
– Krease
1 hour ago
1
switch them in the main and your result should switch... this is not an answer
– DSchmidt
50 mins ago
2
OK, it's not an answer - can't post an image in a comment, and that fiddle can't save a url to put in a comment.. I think this question isn't a question, as much as this answer isn't an answer
– Caius Jard
49 mins ago
|
show 3 more comments
up vote
-1
down vote
I can't reproduce your claim. This java fiddle mostly gets the "opposite" result:

1
Note - I wouldn't trust running a profile through an online service that is sharing CPU time with all the other requests going through it - so much more margin for error this way.
– Krease
1 hour ago
3
If you can't reproduce the results, comment and vote as such. This is not an answer.
– Sotirios Delimanolis
1 hour ago
1
Not completely resistant, but definitely much more resistant, since my own machine isn't hosting a webapp allowing people to run arbitrary programs on it
– Krease
1 hour ago
1
switch them in the main and your result should switch... this is not an answer
– DSchmidt
50 mins ago
2
OK, it's not an answer - can't post an image in a comment, and that fiddle can't save a url to put in a comment.. I think this question isn't a question, as much as this answer isn't an answer
– Caius Jard
49 mins ago
|
show 3 more comments
up vote
-1
down vote
up vote
-1
down vote
I can't reproduce your claim. This java fiddle mostly gets the "opposite" result:

I can't reproduce your claim. This java fiddle mostly gets the "opposite" result:

edited 46 mins ago
answered 1 hour ago
Caius Jard
7,83111135
7,83111135
1
Note - I wouldn't trust running a profile through an online service that is sharing CPU time with all the other requests going through it - so much more margin for error this way.
– Krease
1 hour ago
3
If you can't reproduce the results, comment and vote as such. This is not an answer.
– Sotirios Delimanolis
1 hour ago
1
Not completely resistant, but definitely much more resistant, since my own machine isn't hosting a webapp allowing people to run arbitrary programs on it
– Krease
1 hour ago
1
switch them in the main and your result should switch... this is not an answer
– DSchmidt
50 mins ago
2
OK, it's not an answer - can't post an image in a comment, and that fiddle can't save a url to put in a comment.. I think this question isn't a question, as much as this answer isn't an answer
– Caius Jard
49 mins ago
|
show 3 more comments
1
Note - I wouldn't trust running a profile through an online service that is sharing CPU time with all the other requests going through it - so much more margin for error this way.
– Krease
1 hour ago
3
If you can't reproduce the results, comment and vote as such. This is not an answer.
– Sotirios Delimanolis
1 hour ago
1
Not completely resistant, but definitely much more resistant, since my own machine isn't hosting a webapp allowing people to run arbitrary programs on it
– Krease
1 hour ago
1
switch them in the main and your result should switch... this is not an answer
– DSchmidt
50 mins ago
2
OK, it's not an answer - can't post an image in a comment, and that fiddle can't save a url to put in a comment.. I think this question isn't a question, as much as this answer isn't an answer
– Caius Jard
49 mins ago
1
1
Note - I wouldn't trust running a profile through an online service that is sharing CPU time with all the other requests going through it - so much more margin for error this way.
– Krease
1 hour ago
Note - I wouldn't trust running a profile through an online service that is sharing CPU time with all the other requests going through it - so much more margin for error this way.
– Krease
1 hour ago
3
3
If you can't reproduce the results, comment and vote as such. This is not an answer.
– Sotirios Delimanolis
1 hour ago
If you can't reproduce the results, comment and vote as such. This is not an answer.
– Sotirios Delimanolis
1 hour ago
1
1
Not completely resistant, but definitely much more resistant, since my own machine isn't hosting a webapp allowing people to run arbitrary programs on it
– Krease
1 hour ago
Not completely resistant, but definitely much more resistant, since my own machine isn't hosting a webapp allowing people to run arbitrary programs on it
– Krease
1 hour ago
1
1
switch them in the main and your result should switch... this is not an answer
– DSchmidt
50 mins ago
switch them in the main and your result should switch... this is not an answer
– DSchmidt
50 mins ago
2
2
OK, it's not an answer - can't post an image in a comment, and that fiddle can't save a url to put in a comment.. I think this question isn't a question, as much as this answer isn't an answer
– Caius Jard
49 mins ago
OK, it's not an answer - can't post an image in a comment, and that fiddle can't save a url to put in a comment.. I think this question isn't a question, as much as this answer isn't an answer
– Caius Jard
49 mins ago
|
show 3 more comments
up vote
-2
down vote
I got similar results:
2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736
I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.
Finally, here is a javap -c -v <.java> decompile of each:
3: ldc #3 // String 2 * (i * i):
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: iload 4
30: imul
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
vs.
3: ldc #3 // String 2 * i * i:
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: imul
29: iload 4
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
FYI -
java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
1
A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
– nullpointer
1 hour ago
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
– paulsm4
57 mins ago
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
– Krease
55 mins ago
With JIT, the bytecode doesn't matter much. Show the JIT code.
– rustyx
13 mins ago
add a comment |
up vote
-2
down vote
I got similar results:
2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736
I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.
Finally, here is a javap -c -v <.java> decompile of each:
3: ldc #3 // String 2 * (i * i):
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: iload 4
30: imul
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
vs.
3: ldc #3 // String 2 * i * i:
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: imul
29: iload 4
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
FYI -
java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
1
A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
– nullpointer
1 hour ago
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
– paulsm4
57 mins ago
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
– Krease
55 mins ago
With JIT, the bytecode doesn't matter much. Show the JIT code.
– rustyx
13 mins ago
add a comment |
up vote
-2
down vote
up vote
-2
down vote
I got similar results:
2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736
I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.
Finally, here is a javap -c -v <.java> decompile of each:
3: ldc #3 // String 2 * (i * i):
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: iload 4
30: imul
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
vs.
3: ldc #3 // String 2 * i * i:
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: imul
29: iload 4
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
FYI -
java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
I got similar results:
2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736
I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.
Finally, here is a javap -c -v <.java> decompile of each:
3: ldc #3 // String 2 * (i * i):
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: iload 4
30: imul
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
vs.
3: ldc #3 // String 2 * i * i:
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: imul
29: iload 4
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
FYI -
java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
answered 1 hour ago
paulsm4
75.6k898122
75.6k898122
1
A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
– nullpointer
1 hour ago
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
– paulsm4
57 mins ago
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
– Krease
55 mins ago
With JIT, the bytecode doesn't matter much. Show the JIT code.
– rustyx
13 mins ago
add a comment |
1
A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
– nullpointer
1 hour ago
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
– paulsm4
57 mins ago
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
– Krease
55 mins ago
With JIT, the bytecode doesn't matter much. Show the JIT code.
– rustyx
13 mins ago
1
1
A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
– nullpointer
1 hour ago
A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway.
– nullpointer
1 hour ago
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
– paulsm4
57 mins ago
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant"
– paulsm4
57 mins ago
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
– Krease
55 mins ago
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above
– Krease
55 mins ago
With JIT, the bytecode doesn't matter much. Show the JIT code.
– rustyx
13 mins ago
With JIT, the bytecode doesn't matter much. Show the JIT code.
– rustyx
13 mins ago
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452713%2fwhy-is-2-i-i-faster-than-2-i-i%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
How sure are you that your measurements are done correctly ? How many times did you run each version and how static was your test environment?
– Joakim Danielson
1 hour ago
2
I ran each version around 10 times, and the
2 * i * iversion always takes slightly longer.– Stefan
1 hour ago
4
look at the bytecode using a dissassmber.
– OldProgrammer
1 hour ago
3
Also please see: stackoverflow.com/questions/504103/…
– lexicore
1 hour ago
3
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs.
– Jorn Vernee
42 mins ago