Surprisingly different performance of simple C# program












20















Below is a simple program that with a small change, makes a significant performance impact and I don't understand why.



What the program does is not really relevant, but it calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall. What I noticed as I was changing the code around was a quite large variance in performance.



The rows in question are the commented ones which are mathematically equivalent. Using the slow version makes the entire program take roughly twice as long as using the fast version.



int iterations = 0;

for (int i = 4; i < 9; i++)
{
Stopwatch s = Stopwatch.StartNew();

double ms = 1.0;
double mL = Math.Pow(100.0, i);
double uL = 1.0;
double us = 0.0;
double msmLInv = 1d / (ms + mL);

long collisions = 0;
while (!(uL < 0 && us <= 0 && uL <= us))
{
Debug.Assert(++iterations > 0);
++collisions;

double vs = (2 * mL * uL + us * (ms - mL)) * msmLInv;

//double vL = (2 * ms * us - uL * (ms - mL)) * msmLInv; //fast
double vL = uL + (us - vs) / mL; //slow


Debug.Assert(Math.Abs(((2 * ms * us - uL * (ms - mL)) * msmLInv) - (uL + (us - vs) / mL)) < 0.001d); //checks equality between fast and slow
if (vs > 0)
{
++collisions;
vs = -vs;
}

us = vs;
uL = vL;
}

s.Stop();


Debug.Assert(collisions.ToString() == "314159265359".Substring(0, i + 1)); //check the correctness
Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
}

Debug.Assert(iterations == 174531180); //check that we dont skip loops

Console.Write("Waiting...");
Console.ReadKey();


My intuition says that because the fast version has 7 operations compared to 4 operations of the slow one, the slow one should be faster, but it is not.



I disassembled the program using .NET Reflector which shows that they are mostly equal, as expected, except for the part shown below. The code before and after an identical



//slow
ldloc.s uL
ldloc.2
ldloc.s us
ldloc.s vs
sub
mul
ldloc.3
div
add




//fast
ldc.r8 2
ldloc.2
mul
ldloc.s us
mul
ldloc.s uL
ldloc.2
ldloc.3
sub
mul
sub
ldloc.2
ldloc.3
add
div


This also shows that more code is executing with the fast version which also would lead me to expect it to be slower.



The only guess I have right now is that the slow version causes more cache misses, but I don't know how to measure that (a guide would be welcome). Other than that I am at a loss.





EDIT 1.
As per the request of @EricLippert here is the disassembly from the JIT for the inner while loop where the difference is.



EDIT 2.
Solved how to break in the release program and updated the disassembly so now there seems to be some difference. I got these results by running the release version, stopping the program in the same function with a ReadKey, attaching the debugger, making the program continue execution, breaking on the next row, going into disassembly window (ctrl+alt+d)



EDIT 3.
Change the code to an updated example base on all the suggestions.



//slow
78:
79: vs = (2 * mL * uL + us * (ms - mL)) / (ms + mL);
00C10530 call CA9AD013
00C10535 fdiv st,st(3)
00C10537 faddp st(2),st
80:
81: //double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
82: double vL = uL + ms * (us - vs) / mL; //slow
00C10539 fldz
00C1053B fcomip st,st(1)
00C1053D jp 00C10549
00C1053F jae 00C10549
00C10541 add ebx,1
00C10544 adc edi,0
00C10547 fchs
00C10549 fld st(1)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
00C1054B fldz
00C1054D fcomip st,st(3)
00C1054F fstp st(2)
00C10551 jp 00C10508
00C10553 jbe 00C10508
00C10555 fldz
00C10557 fcomip st,st(1)
00C10559 jp 00C10508
00C1055B jb 00C10508
00C1055D fxch st(1)
00C1055F fcomi st,st(1)
00C10561 jnp 00C10567
00C10563 fxch st(1)
00C10565 jmp 00C10508
00C10567 jbe 00C1056D
00C10569 fxch st(1)
00C1056B jmp 00C10508
00C1056D fstp st(1)
00C1056F fstp st(0)
00C10571 fstp st(0)
92: }
93:
94: s.Stop();
00C10573 mov ecx,esi
00C10575 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
00C1057A mov ecx,725B0994h
00C1057F call 00B930C8
00C10584 mov edx,eax
00C10586 mov eax,dword ptr [ebp-14h]
00C10589 mov dword ptr [edx+4],eax
00C1058C mov dword ptr [ebp-34h],edx
00C1058F mov ecx,725F3778h
00C10594 call 00B930C8
00C10599 mov dword ptr [ebp-38h],eax
00C1059C mov ecx,725F2C10h
00C105A1 call 00B930C8
00C105A6 mov dword ptr [ebp-3Ch],eax
00C105A9 mov ecx,esi
00C105AB call 71835820
00C105B0 push edx
00C105B1 push eax
00C105B2 push 0
00C105B4 push 2710h
00C105B9 call 736071A0
00C105BE mov dword ptr [ebp-48h],eax
00C105C1 mov dword ptr [ebp-44h],edx
00C105C4 fild qword ptr [ebp-48h]
00C105C7 fstp dword ptr [ebp-40h]
00C105CA fld dword ptr [ebp-40h]
00C105CD fdiv dword ptr ds:[0C10678h]
00C105D3 mov eax,dword ptr [ebp-38h]
00C105D6 fstp dword ptr [eax+4]
00C105D9 mov edx,dword ptr [ebp-38h]
00C105DC mov eax,dword ptr [ebp-3Ch]
00C105DF mov dword ptr [eax+4],ebx
00C105E2 mov dword ptr [eax+8],edi
00C105E5 mov esi,dword ptr [ebp-3Ch]
00C105E8 lea edi,[ebp-30h]
00C105EB xorps xmm0,xmm0
00C105EE movq mmword ptr [edi],xmm0
00C105F2 movq mmword ptr [edi+8],xmm0
00C105F7 push edx
00C105F8 push esi
00C105F9 lea ecx,[ebp-30h]
00C105FC mov edx,dword ptr [ebp-34h]
00C105FF call 724A2ED4
00C10604 lea eax,[ebp-30h]
00C10607 push dword ptr [eax+0Ch]
00C1060A push dword ptr [eax+8]
00C1060D push dword ptr [eax+4]
00C10610 push dword ptr [eax]
00C10612 mov edx,dword ptr ds:[3832310h]
00C10618 xor ecx,ecx
00C1061A call 72497A00
00C1061F mov ecx,eax
00C10621 call 72571934
61: for (int i = 4; i < 9; i++)
00C10626 inc dword ptr [ebp-14h]
00C10629 cmp dword ptr [ebp-14h],9
00C1062D jl 00C10496
97: }
98:
99: Console.WriteLine(loops);
00C10633 mov ecx,dword ptr [ebp-10h]
00C10636 call 72C583FC
100: Console.Write("Waiting...");
00C1063B mov ecx,dword ptr ds:[3832314h]
00C10641 call 724C67F0
00C10646 lea ecx,[ebp-20h]
00C10649 xor edx,edx
00C1064B call 72C57984
00C10650 lea esp,[ebp-0Ch]
00C10653 pop ebx
00C10654 pop esi
00C10655 pop edi
00C10656 pop ebp
00C10657 ret




//fast
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0550 or al,83h
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0552 ret
02FD0553 add dword ptr [ebx-3626FF29h],eax
02FD0559 fchs
02FD055B fxch st(1)
02FD055D fld st(0)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
02FD055F fldz
02FD0561 fcomip st,st(2)
02FD0563 fstp st(1)
02FD0565 jnp 02FD056B
02FD0567 fxch st(1)
02FD0569 jmp 02FD050B
02FD056B ja 02FD0571
02FD056D fxch st(1)
02FD056F jmp 02FD050B
02FD0571 fldz
02FD0573 fcomip st,st(2)
02FD0575 jnp 02FD057B
02FD0577 fxch st(1)
02FD0579 jmp 02FD050B
02FD057B jae 02FD0581
02FD057D fxch st(1)
02FD057F jmp 02FD050B
02FD0581 fcomi st,st(1)
02FD0583 jnp 02FD0589
02FD0585 fxch st(1)
02FD0587 jmp 02FD050B
02FD0589 jbe 02FD0592
02FD058B fxch st(1)
02FD058D jmp 02FD050B
02FD0592 fstp st(1)
02FD0594 fstp st(0)
92: }
93:
94: s.Stop();
02FD0596 mov ecx,esi
02FD0598 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
02FD059D mov ecx,725B0994h
02FD05A2 call 013830C8
02FD05A7 mov edx,eax
02FD05A9 mov eax,dword ptr [ebp-14h]
02FD05AC mov dword ptr [edx+4],eax
02FD05AF mov dword ptr [ebp-3Ch],edx
02FD05B2 mov ecx,725F3778h
02FD05B7 call 013830C8
02FD05BC mov dword ptr [ebp-40h],eax
02FD05BF mov ecx,725F2C10h
02FD05C4 call 013830C8
02FD05C9 mov dword ptr [ebp-44h],eax
02FD05CC mov ecx,esi
02FD05CE call 71835820
02FD05D3 push edx
02FD05D4 push eax
02FD05D5 push 0
02FD05D7 push 2710h
02FD05DC call 736071A0
02FD05E1 mov dword ptr [ebp-50h],eax
02FD05E4 mov dword ptr [ebp-4Ch],edx
02FD05E7 fild qword ptr [ebp-50h]
02FD05EA fstp dword ptr [ebp-48h]
02FD05ED fld dword ptr [ebp-48h]
02FD05F0 fdiv dword ptr ds:[2FD06A8h]
02FD05F6 mov eax,dword ptr [ebp-40h]
02FD05F9 fstp dword ptr [eax+4]
02FD05FC mov edx,dword ptr [ebp-40h]
02FD05FF mov eax,dword ptr [ebp-44h]
02FD0602 mov dword ptr [eax+4],ebx
02FD0605 mov dword ptr [eax+8],edi
02FD0608 mov esi,dword ptr [ebp-44h]
02FD060B lea edi,[ebp-38h]
02FD060E xorps xmm0,xmm0
02FD0611 movq mmword ptr [edi],xmm0
02FD0615 movq mmword ptr [edi+8],xmm0
02FD061A push edx
02FD061B push esi
02FD061C lea ecx,[ebp-38h]
02FD061F mov edx,dword ptr [ebp-3Ch]
02FD0622 call 724A2ED4
02FD0627 lea eax,[ebp-38h]
02FD062A push dword ptr [eax+0Ch]
02FD062D push dword ptr [eax+8]
02FD0630 push dword ptr [eax+4]
02FD0633 push dword ptr [eax]
02FD0635 mov edx,dword ptr ds:[4142310h]
02FD063B xor ecx,ecx
02FD063D call 72497A00
02FD0642 mov ecx,eax
02FD0644 call 72571934
61: for (int i = 4; i < 9; i++)
02FD0649 inc dword ptr [ebp-14h]
02FD064C cmp dword ptr [ebp-14h],9
02FD0650 jl 02FD0496
97: }
98:
99: Console.WriteLine(loops);
02FD0656 mov ecx,dword ptr [ebp-10h]
02FD0659 call 72C583FC
100: Console.Write("Waiting...");
02FD065E mov ecx,dword ptr ds:[4142314h]
02FD0664 call 724C67F0
02FD0669 lea ecx,[ebp-28h]
02FD066C xor edx,edx
02FD066E call 72C57984
02FD0673 lea esp,[ebp-0Ch]
02FD0676 pop ebx
02FD0677 pop esi
02FD0678 pop edi
02FD0679 pop ebp
02FD067A ret












share|improve this question




















  • 3





    I didn't check this in detail, but it all looks somewhat complicated, also the condition in the loop header. Are you entirely sure that there isn't a minor logical error causing the inner loop to run a different amount of times? You could insert a dumb counter variable in the inner loop an count the number of runs before/after just to be safe.

    – DanDan
    Jan 14 at 20:36








  • 1





    calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall - youtube.com/watch?v=HEfHFsfGXjs?

    – GSerg
    Jan 14 at 20:41






  • 3





    This is a very odd result! I cannot see any way that it can be cache misses. A better guess would be register scheduling in the FP registers. Are you doing your measurements in a release build and not in the debugger? Measuring performance in the debugger can cause register scheduling to be deoptimized.

    – Eric Lippert
    Jan 14 at 21:06






  • 1





    My best guess is register scheduling by the jitter then; it would be interesting to look at the generated code for both versions and see how they differ.

    – Eric Lippert
    Jan 14 at 21:19






  • 2





    It is a little tricky to get the generated code. Usually what I do in these situations is start the program outside of the debugger, but put a "readline" or some such thing that makes it pause just as it is about to finish; it has to be after all the code has jitted. Then while the program is paused, attach a debugger and force execution to step back into the method, and then look at the disassembly in the debugger. That way you definitely see the method as it was jitted without the debugger attached.

    – Eric Lippert
    Jan 14 at 21:25
















20















Below is a simple program that with a small change, makes a significant performance impact and I don't understand why.



What the program does is not really relevant, but it calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall. What I noticed as I was changing the code around was a quite large variance in performance.



The rows in question are the commented ones which are mathematically equivalent. Using the slow version makes the entire program take roughly twice as long as using the fast version.



int iterations = 0;

for (int i = 4; i < 9; i++)
{
Stopwatch s = Stopwatch.StartNew();

double ms = 1.0;
double mL = Math.Pow(100.0, i);
double uL = 1.0;
double us = 0.0;
double msmLInv = 1d / (ms + mL);

long collisions = 0;
while (!(uL < 0 && us <= 0 && uL <= us))
{
Debug.Assert(++iterations > 0);
++collisions;

double vs = (2 * mL * uL + us * (ms - mL)) * msmLInv;

//double vL = (2 * ms * us - uL * (ms - mL)) * msmLInv; //fast
double vL = uL + (us - vs) / mL; //slow


Debug.Assert(Math.Abs(((2 * ms * us - uL * (ms - mL)) * msmLInv) - (uL + (us - vs) / mL)) < 0.001d); //checks equality between fast and slow
if (vs > 0)
{
++collisions;
vs = -vs;
}

us = vs;
uL = vL;
}

s.Stop();


Debug.Assert(collisions.ToString() == "314159265359".Substring(0, i + 1)); //check the correctness
Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
}

Debug.Assert(iterations == 174531180); //check that we dont skip loops

Console.Write("Waiting...");
Console.ReadKey();


My intuition says that because the fast version has 7 operations compared to 4 operations of the slow one, the slow one should be faster, but it is not.



I disassembled the program using .NET Reflector which shows that they are mostly equal, as expected, except for the part shown below. The code before and after an identical



//slow
ldloc.s uL
ldloc.2
ldloc.s us
ldloc.s vs
sub
mul
ldloc.3
div
add




//fast
ldc.r8 2
ldloc.2
mul
ldloc.s us
mul
ldloc.s uL
ldloc.2
ldloc.3
sub
mul
sub
ldloc.2
ldloc.3
add
div


This also shows that more code is executing with the fast version which also would lead me to expect it to be slower.



The only guess I have right now is that the slow version causes more cache misses, but I don't know how to measure that (a guide would be welcome). Other than that I am at a loss.





EDIT 1.
As per the request of @EricLippert here is the disassembly from the JIT for the inner while loop where the difference is.



EDIT 2.
Solved how to break in the release program and updated the disassembly so now there seems to be some difference. I got these results by running the release version, stopping the program in the same function with a ReadKey, attaching the debugger, making the program continue execution, breaking on the next row, going into disassembly window (ctrl+alt+d)



EDIT 3.
Change the code to an updated example base on all the suggestions.



//slow
78:
79: vs = (2 * mL * uL + us * (ms - mL)) / (ms + mL);
00C10530 call CA9AD013
00C10535 fdiv st,st(3)
00C10537 faddp st(2),st
80:
81: //double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
82: double vL = uL + ms * (us - vs) / mL; //slow
00C10539 fldz
00C1053B fcomip st,st(1)
00C1053D jp 00C10549
00C1053F jae 00C10549
00C10541 add ebx,1
00C10544 adc edi,0
00C10547 fchs
00C10549 fld st(1)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
00C1054B fldz
00C1054D fcomip st,st(3)
00C1054F fstp st(2)
00C10551 jp 00C10508
00C10553 jbe 00C10508
00C10555 fldz
00C10557 fcomip st,st(1)
00C10559 jp 00C10508
00C1055B jb 00C10508
00C1055D fxch st(1)
00C1055F fcomi st,st(1)
00C10561 jnp 00C10567
00C10563 fxch st(1)
00C10565 jmp 00C10508
00C10567 jbe 00C1056D
00C10569 fxch st(1)
00C1056B jmp 00C10508
00C1056D fstp st(1)
00C1056F fstp st(0)
00C10571 fstp st(0)
92: }
93:
94: s.Stop();
00C10573 mov ecx,esi
00C10575 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
00C1057A mov ecx,725B0994h
00C1057F call 00B930C8
00C10584 mov edx,eax
00C10586 mov eax,dword ptr [ebp-14h]
00C10589 mov dword ptr [edx+4],eax
00C1058C mov dword ptr [ebp-34h],edx
00C1058F mov ecx,725F3778h
00C10594 call 00B930C8
00C10599 mov dword ptr [ebp-38h],eax
00C1059C mov ecx,725F2C10h
00C105A1 call 00B930C8
00C105A6 mov dword ptr [ebp-3Ch],eax
00C105A9 mov ecx,esi
00C105AB call 71835820
00C105B0 push edx
00C105B1 push eax
00C105B2 push 0
00C105B4 push 2710h
00C105B9 call 736071A0
00C105BE mov dword ptr [ebp-48h],eax
00C105C1 mov dword ptr [ebp-44h],edx
00C105C4 fild qword ptr [ebp-48h]
00C105C7 fstp dword ptr [ebp-40h]
00C105CA fld dword ptr [ebp-40h]
00C105CD fdiv dword ptr ds:[0C10678h]
00C105D3 mov eax,dword ptr [ebp-38h]
00C105D6 fstp dword ptr [eax+4]
00C105D9 mov edx,dword ptr [ebp-38h]
00C105DC mov eax,dword ptr [ebp-3Ch]
00C105DF mov dword ptr [eax+4],ebx
00C105E2 mov dword ptr [eax+8],edi
00C105E5 mov esi,dword ptr [ebp-3Ch]
00C105E8 lea edi,[ebp-30h]
00C105EB xorps xmm0,xmm0
00C105EE movq mmword ptr [edi],xmm0
00C105F2 movq mmword ptr [edi+8],xmm0
00C105F7 push edx
00C105F8 push esi
00C105F9 lea ecx,[ebp-30h]
00C105FC mov edx,dword ptr [ebp-34h]
00C105FF call 724A2ED4
00C10604 lea eax,[ebp-30h]
00C10607 push dword ptr [eax+0Ch]
00C1060A push dword ptr [eax+8]
00C1060D push dword ptr [eax+4]
00C10610 push dword ptr [eax]
00C10612 mov edx,dword ptr ds:[3832310h]
00C10618 xor ecx,ecx
00C1061A call 72497A00
00C1061F mov ecx,eax
00C10621 call 72571934
61: for (int i = 4; i < 9; i++)
00C10626 inc dword ptr [ebp-14h]
00C10629 cmp dword ptr [ebp-14h],9
00C1062D jl 00C10496
97: }
98:
99: Console.WriteLine(loops);
00C10633 mov ecx,dword ptr [ebp-10h]
00C10636 call 72C583FC
100: Console.Write("Waiting...");
00C1063B mov ecx,dword ptr ds:[3832314h]
00C10641 call 724C67F0
00C10646 lea ecx,[ebp-20h]
00C10649 xor edx,edx
00C1064B call 72C57984
00C10650 lea esp,[ebp-0Ch]
00C10653 pop ebx
00C10654 pop esi
00C10655 pop edi
00C10656 pop ebp
00C10657 ret




//fast
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0550 or al,83h
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0552 ret
02FD0553 add dword ptr [ebx-3626FF29h],eax
02FD0559 fchs
02FD055B fxch st(1)
02FD055D fld st(0)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
02FD055F fldz
02FD0561 fcomip st,st(2)
02FD0563 fstp st(1)
02FD0565 jnp 02FD056B
02FD0567 fxch st(1)
02FD0569 jmp 02FD050B
02FD056B ja 02FD0571
02FD056D fxch st(1)
02FD056F jmp 02FD050B
02FD0571 fldz
02FD0573 fcomip st,st(2)
02FD0575 jnp 02FD057B
02FD0577 fxch st(1)
02FD0579 jmp 02FD050B
02FD057B jae 02FD0581
02FD057D fxch st(1)
02FD057F jmp 02FD050B
02FD0581 fcomi st,st(1)
02FD0583 jnp 02FD0589
02FD0585 fxch st(1)
02FD0587 jmp 02FD050B
02FD0589 jbe 02FD0592
02FD058B fxch st(1)
02FD058D jmp 02FD050B
02FD0592 fstp st(1)
02FD0594 fstp st(0)
92: }
93:
94: s.Stop();
02FD0596 mov ecx,esi
02FD0598 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
02FD059D mov ecx,725B0994h
02FD05A2 call 013830C8
02FD05A7 mov edx,eax
02FD05A9 mov eax,dword ptr [ebp-14h]
02FD05AC mov dword ptr [edx+4],eax
02FD05AF mov dword ptr [ebp-3Ch],edx
02FD05B2 mov ecx,725F3778h
02FD05B7 call 013830C8
02FD05BC mov dword ptr [ebp-40h],eax
02FD05BF mov ecx,725F2C10h
02FD05C4 call 013830C8
02FD05C9 mov dword ptr [ebp-44h],eax
02FD05CC mov ecx,esi
02FD05CE call 71835820
02FD05D3 push edx
02FD05D4 push eax
02FD05D5 push 0
02FD05D7 push 2710h
02FD05DC call 736071A0
02FD05E1 mov dword ptr [ebp-50h],eax
02FD05E4 mov dword ptr [ebp-4Ch],edx
02FD05E7 fild qword ptr [ebp-50h]
02FD05EA fstp dword ptr [ebp-48h]
02FD05ED fld dword ptr [ebp-48h]
02FD05F0 fdiv dword ptr ds:[2FD06A8h]
02FD05F6 mov eax,dword ptr [ebp-40h]
02FD05F9 fstp dword ptr [eax+4]
02FD05FC mov edx,dword ptr [ebp-40h]
02FD05FF mov eax,dword ptr [ebp-44h]
02FD0602 mov dword ptr [eax+4],ebx
02FD0605 mov dword ptr [eax+8],edi
02FD0608 mov esi,dword ptr [ebp-44h]
02FD060B lea edi,[ebp-38h]
02FD060E xorps xmm0,xmm0
02FD0611 movq mmword ptr [edi],xmm0
02FD0615 movq mmword ptr [edi+8],xmm0
02FD061A push edx
02FD061B push esi
02FD061C lea ecx,[ebp-38h]
02FD061F mov edx,dword ptr [ebp-3Ch]
02FD0622 call 724A2ED4
02FD0627 lea eax,[ebp-38h]
02FD062A push dword ptr [eax+0Ch]
02FD062D push dword ptr [eax+8]
02FD0630 push dword ptr [eax+4]
02FD0633 push dword ptr [eax]
02FD0635 mov edx,dword ptr ds:[4142310h]
02FD063B xor ecx,ecx
02FD063D call 72497A00
02FD0642 mov ecx,eax
02FD0644 call 72571934
61: for (int i = 4; i < 9; i++)
02FD0649 inc dword ptr [ebp-14h]
02FD064C cmp dword ptr [ebp-14h],9
02FD0650 jl 02FD0496
97: }
98:
99: Console.WriteLine(loops);
02FD0656 mov ecx,dword ptr [ebp-10h]
02FD0659 call 72C583FC
100: Console.Write("Waiting...");
02FD065E mov ecx,dword ptr ds:[4142314h]
02FD0664 call 724C67F0
02FD0669 lea ecx,[ebp-28h]
02FD066C xor edx,edx
02FD066E call 72C57984
02FD0673 lea esp,[ebp-0Ch]
02FD0676 pop ebx
02FD0677 pop esi
02FD0678 pop edi
02FD0679 pop ebp
02FD067A ret












share|improve this question




















  • 3





    I didn't check this in detail, but it all looks somewhat complicated, also the condition in the loop header. Are you entirely sure that there isn't a minor logical error causing the inner loop to run a different amount of times? You could insert a dumb counter variable in the inner loop an count the number of runs before/after just to be safe.

    – DanDan
    Jan 14 at 20:36








  • 1





    calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall - youtube.com/watch?v=HEfHFsfGXjs?

    – GSerg
    Jan 14 at 20:41






  • 3





    This is a very odd result! I cannot see any way that it can be cache misses. A better guess would be register scheduling in the FP registers. Are you doing your measurements in a release build and not in the debugger? Measuring performance in the debugger can cause register scheduling to be deoptimized.

    – Eric Lippert
    Jan 14 at 21:06






  • 1





    My best guess is register scheduling by the jitter then; it would be interesting to look at the generated code for both versions and see how they differ.

    – Eric Lippert
    Jan 14 at 21:19






  • 2





    It is a little tricky to get the generated code. Usually what I do in these situations is start the program outside of the debugger, but put a "readline" or some such thing that makes it pause just as it is about to finish; it has to be after all the code has jitted. Then while the program is paused, attach a debugger and force execution to step back into the method, and then look at the disassembly in the debugger. That way you definitely see the method as it was jitted without the debugger attached.

    – Eric Lippert
    Jan 14 at 21:25














20












20








20


4






Below is a simple program that with a small change, makes a significant performance impact and I don't understand why.



What the program does is not really relevant, but it calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall. What I noticed as I was changing the code around was a quite large variance in performance.



The rows in question are the commented ones which are mathematically equivalent. Using the slow version makes the entire program take roughly twice as long as using the fast version.



int iterations = 0;

for (int i = 4; i < 9; i++)
{
Stopwatch s = Stopwatch.StartNew();

double ms = 1.0;
double mL = Math.Pow(100.0, i);
double uL = 1.0;
double us = 0.0;
double msmLInv = 1d / (ms + mL);

long collisions = 0;
while (!(uL < 0 && us <= 0 && uL <= us))
{
Debug.Assert(++iterations > 0);
++collisions;

double vs = (2 * mL * uL + us * (ms - mL)) * msmLInv;

//double vL = (2 * ms * us - uL * (ms - mL)) * msmLInv; //fast
double vL = uL + (us - vs) / mL; //slow


Debug.Assert(Math.Abs(((2 * ms * us - uL * (ms - mL)) * msmLInv) - (uL + (us - vs) / mL)) < 0.001d); //checks equality between fast and slow
if (vs > 0)
{
++collisions;
vs = -vs;
}

us = vs;
uL = vL;
}

s.Stop();


Debug.Assert(collisions.ToString() == "314159265359".Substring(0, i + 1)); //check the correctness
Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
}

Debug.Assert(iterations == 174531180); //check that we dont skip loops

Console.Write("Waiting...");
Console.ReadKey();


My intuition says that because the fast version has 7 operations compared to 4 operations of the slow one, the slow one should be faster, but it is not.



I disassembled the program using .NET Reflector which shows that they are mostly equal, as expected, except for the part shown below. The code before and after an identical



//slow
ldloc.s uL
ldloc.2
ldloc.s us
ldloc.s vs
sub
mul
ldloc.3
div
add




//fast
ldc.r8 2
ldloc.2
mul
ldloc.s us
mul
ldloc.s uL
ldloc.2
ldloc.3
sub
mul
sub
ldloc.2
ldloc.3
add
div


This also shows that more code is executing with the fast version which also would lead me to expect it to be slower.



The only guess I have right now is that the slow version causes more cache misses, but I don't know how to measure that (a guide would be welcome). Other than that I am at a loss.





EDIT 1.
As per the request of @EricLippert here is the disassembly from the JIT for the inner while loop where the difference is.



EDIT 2.
Solved how to break in the release program and updated the disassembly so now there seems to be some difference. I got these results by running the release version, stopping the program in the same function with a ReadKey, attaching the debugger, making the program continue execution, breaking on the next row, going into disassembly window (ctrl+alt+d)



EDIT 3.
Change the code to an updated example base on all the suggestions.



//slow
78:
79: vs = (2 * mL * uL + us * (ms - mL)) / (ms + mL);
00C10530 call CA9AD013
00C10535 fdiv st,st(3)
00C10537 faddp st(2),st
80:
81: //double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
82: double vL = uL + ms * (us - vs) / mL; //slow
00C10539 fldz
00C1053B fcomip st,st(1)
00C1053D jp 00C10549
00C1053F jae 00C10549
00C10541 add ebx,1
00C10544 adc edi,0
00C10547 fchs
00C10549 fld st(1)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
00C1054B fldz
00C1054D fcomip st,st(3)
00C1054F fstp st(2)
00C10551 jp 00C10508
00C10553 jbe 00C10508
00C10555 fldz
00C10557 fcomip st,st(1)
00C10559 jp 00C10508
00C1055B jb 00C10508
00C1055D fxch st(1)
00C1055F fcomi st,st(1)
00C10561 jnp 00C10567
00C10563 fxch st(1)
00C10565 jmp 00C10508
00C10567 jbe 00C1056D
00C10569 fxch st(1)
00C1056B jmp 00C10508
00C1056D fstp st(1)
00C1056F fstp st(0)
00C10571 fstp st(0)
92: }
93:
94: s.Stop();
00C10573 mov ecx,esi
00C10575 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
00C1057A mov ecx,725B0994h
00C1057F call 00B930C8
00C10584 mov edx,eax
00C10586 mov eax,dword ptr [ebp-14h]
00C10589 mov dword ptr [edx+4],eax
00C1058C mov dword ptr [ebp-34h],edx
00C1058F mov ecx,725F3778h
00C10594 call 00B930C8
00C10599 mov dword ptr [ebp-38h],eax
00C1059C mov ecx,725F2C10h
00C105A1 call 00B930C8
00C105A6 mov dword ptr [ebp-3Ch],eax
00C105A9 mov ecx,esi
00C105AB call 71835820
00C105B0 push edx
00C105B1 push eax
00C105B2 push 0
00C105B4 push 2710h
00C105B9 call 736071A0
00C105BE mov dword ptr [ebp-48h],eax
00C105C1 mov dword ptr [ebp-44h],edx
00C105C4 fild qword ptr [ebp-48h]
00C105C7 fstp dword ptr [ebp-40h]
00C105CA fld dword ptr [ebp-40h]
00C105CD fdiv dword ptr ds:[0C10678h]
00C105D3 mov eax,dword ptr [ebp-38h]
00C105D6 fstp dword ptr [eax+4]
00C105D9 mov edx,dword ptr [ebp-38h]
00C105DC mov eax,dword ptr [ebp-3Ch]
00C105DF mov dword ptr [eax+4],ebx
00C105E2 mov dword ptr [eax+8],edi
00C105E5 mov esi,dword ptr [ebp-3Ch]
00C105E8 lea edi,[ebp-30h]
00C105EB xorps xmm0,xmm0
00C105EE movq mmword ptr [edi],xmm0
00C105F2 movq mmword ptr [edi+8],xmm0
00C105F7 push edx
00C105F8 push esi
00C105F9 lea ecx,[ebp-30h]
00C105FC mov edx,dword ptr [ebp-34h]
00C105FF call 724A2ED4
00C10604 lea eax,[ebp-30h]
00C10607 push dword ptr [eax+0Ch]
00C1060A push dword ptr [eax+8]
00C1060D push dword ptr [eax+4]
00C10610 push dword ptr [eax]
00C10612 mov edx,dword ptr ds:[3832310h]
00C10618 xor ecx,ecx
00C1061A call 72497A00
00C1061F mov ecx,eax
00C10621 call 72571934
61: for (int i = 4; i < 9; i++)
00C10626 inc dword ptr [ebp-14h]
00C10629 cmp dword ptr [ebp-14h],9
00C1062D jl 00C10496
97: }
98:
99: Console.WriteLine(loops);
00C10633 mov ecx,dword ptr [ebp-10h]
00C10636 call 72C583FC
100: Console.Write("Waiting...");
00C1063B mov ecx,dword ptr ds:[3832314h]
00C10641 call 724C67F0
00C10646 lea ecx,[ebp-20h]
00C10649 xor edx,edx
00C1064B call 72C57984
00C10650 lea esp,[ebp-0Ch]
00C10653 pop ebx
00C10654 pop esi
00C10655 pop edi
00C10656 pop ebp
00C10657 ret




//fast
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0550 or al,83h
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0552 ret
02FD0553 add dword ptr [ebx-3626FF29h],eax
02FD0559 fchs
02FD055B fxch st(1)
02FD055D fld st(0)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
02FD055F fldz
02FD0561 fcomip st,st(2)
02FD0563 fstp st(1)
02FD0565 jnp 02FD056B
02FD0567 fxch st(1)
02FD0569 jmp 02FD050B
02FD056B ja 02FD0571
02FD056D fxch st(1)
02FD056F jmp 02FD050B
02FD0571 fldz
02FD0573 fcomip st,st(2)
02FD0575 jnp 02FD057B
02FD0577 fxch st(1)
02FD0579 jmp 02FD050B
02FD057B jae 02FD0581
02FD057D fxch st(1)
02FD057F jmp 02FD050B
02FD0581 fcomi st,st(1)
02FD0583 jnp 02FD0589
02FD0585 fxch st(1)
02FD0587 jmp 02FD050B
02FD0589 jbe 02FD0592
02FD058B fxch st(1)
02FD058D jmp 02FD050B
02FD0592 fstp st(1)
02FD0594 fstp st(0)
92: }
93:
94: s.Stop();
02FD0596 mov ecx,esi
02FD0598 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
02FD059D mov ecx,725B0994h
02FD05A2 call 013830C8
02FD05A7 mov edx,eax
02FD05A9 mov eax,dword ptr [ebp-14h]
02FD05AC mov dword ptr [edx+4],eax
02FD05AF mov dword ptr [ebp-3Ch],edx
02FD05B2 mov ecx,725F3778h
02FD05B7 call 013830C8
02FD05BC mov dword ptr [ebp-40h],eax
02FD05BF mov ecx,725F2C10h
02FD05C4 call 013830C8
02FD05C9 mov dword ptr [ebp-44h],eax
02FD05CC mov ecx,esi
02FD05CE call 71835820
02FD05D3 push edx
02FD05D4 push eax
02FD05D5 push 0
02FD05D7 push 2710h
02FD05DC call 736071A0
02FD05E1 mov dword ptr [ebp-50h],eax
02FD05E4 mov dword ptr [ebp-4Ch],edx
02FD05E7 fild qword ptr [ebp-50h]
02FD05EA fstp dword ptr [ebp-48h]
02FD05ED fld dword ptr [ebp-48h]
02FD05F0 fdiv dword ptr ds:[2FD06A8h]
02FD05F6 mov eax,dword ptr [ebp-40h]
02FD05F9 fstp dword ptr [eax+4]
02FD05FC mov edx,dword ptr [ebp-40h]
02FD05FF mov eax,dword ptr [ebp-44h]
02FD0602 mov dword ptr [eax+4],ebx
02FD0605 mov dword ptr [eax+8],edi
02FD0608 mov esi,dword ptr [ebp-44h]
02FD060B lea edi,[ebp-38h]
02FD060E xorps xmm0,xmm0
02FD0611 movq mmword ptr [edi],xmm0
02FD0615 movq mmword ptr [edi+8],xmm0
02FD061A push edx
02FD061B push esi
02FD061C lea ecx,[ebp-38h]
02FD061F mov edx,dword ptr [ebp-3Ch]
02FD0622 call 724A2ED4
02FD0627 lea eax,[ebp-38h]
02FD062A push dword ptr [eax+0Ch]
02FD062D push dword ptr [eax+8]
02FD0630 push dword ptr [eax+4]
02FD0633 push dword ptr [eax]
02FD0635 mov edx,dword ptr ds:[4142310h]
02FD063B xor ecx,ecx
02FD063D call 72497A00
02FD0642 mov ecx,eax
02FD0644 call 72571934
61: for (int i = 4; i < 9; i++)
02FD0649 inc dword ptr [ebp-14h]
02FD064C cmp dword ptr [ebp-14h],9
02FD0650 jl 02FD0496
97: }
98:
99: Console.WriteLine(loops);
02FD0656 mov ecx,dword ptr [ebp-10h]
02FD0659 call 72C583FC
100: Console.Write("Waiting...");
02FD065E mov ecx,dword ptr ds:[4142314h]
02FD0664 call 724C67F0
02FD0669 lea ecx,[ebp-28h]
02FD066C xor edx,edx
02FD066E call 72C57984
02FD0673 lea esp,[ebp-0Ch]
02FD0676 pop ebx
02FD0677 pop esi
02FD0678 pop edi
02FD0679 pop ebp
02FD067A ret












share|improve this question
















Below is a simple program that with a small change, makes a significant performance impact and I don't understand why.



What the program does is not really relevant, but it calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall. What I noticed as I was changing the code around was a quite large variance in performance.



The rows in question are the commented ones which are mathematically equivalent. Using the slow version makes the entire program take roughly twice as long as using the fast version.



int iterations = 0;

for (int i = 4; i < 9; i++)
{
Stopwatch s = Stopwatch.StartNew();

double ms = 1.0;
double mL = Math.Pow(100.0, i);
double uL = 1.0;
double us = 0.0;
double msmLInv = 1d / (ms + mL);

long collisions = 0;
while (!(uL < 0 && us <= 0 && uL <= us))
{
Debug.Assert(++iterations > 0);
++collisions;

double vs = (2 * mL * uL + us * (ms - mL)) * msmLInv;

//double vL = (2 * ms * us - uL * (ms - mL)) * msmLInv; //fast
double vL = uL + (us - vs) / mL; //slow


Debug.Assert(Math.Abs(((2 * ms * us - uL * (ms - mL)) * msmLInv) - (uL + (us - vs) / mL)) < 0.001d); //checks equality between fast and slow
if (vs > 0)
{
++collisions;
vs = -vs;
}

us = vs;
uL = vL;
}

s.Stop();


Debug.Assert(collisions.ToString() == "314159265359".Substring(0, i + 1)); //check the correctness
Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
}

Debug.Assert(iterations == 174531180); //check that we dont skip loops

Console.Write("Waiting...");
Console.ReadKey();


My intuition says that because the fast version has 7 operations compared to 4 operations of the slow one, the slow one should be faster, but it is not.



I disassembled the program using .NET Reflector which shows that they are mostly equal, as expected, except for the part shown below. The code before and after an identical



//slow
ldloc.s uL
ldloc.2
ldloc.s us
ldloc.s vs
sub
mul
ldloc.3
div
add




//fast
ldc.r8 2
ldloc.2
mul
ldloc.s us
mul
ldloc.s uL
ldloc.2
ldloc.3
sub
mul
sub
ldloc.2
ldloc.3
add
div


This also shows that more code is executing with the fast version which also would lead me to expect it to be slower.



The only guess I have right now is that the slow version causes more cache misses, but I don't know how to measure that (a guide would be welcome). Other than that I am at a loss.





EDIT 1.
As per the request of @EricLippert here is the disassembly from the JIT for the inner while loop where the difference is.



EDIT 2.
Solved how to break in the release program and updated the disassembly so now there seems to be some difference. I got these results by running the release version, stopping the program in the same function with a ReadKey, attaching the debugger, making the program continue execution, breaking on the next row, going into disassembly window (ctrl+alt+d)



EDIT 3.
Change the code to an updated example base on all the suggestions.



//slow
78:
79: vs = (2 * mL * uL + us * (ms - mL)) / (ms + mL);
00C10530 call CA9AD013
00C10535 fdiv st,st(3)
00C10537 faddp st(2),st
80:
81: //double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
82: double vL = uL + ms * (us - vs) / mL; //slow
00C10539 fldz
00C1053B fcomip st,st(1)
00C1053D jp 00C10549
00C1053F jae 00C10549
00C10541 add ebx,1
00C10544 adc edi,0
00C10547 fchs
00C10549 fld st(1)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
00C1054B fldz
00C1054D fcomip st,st(3)
00C1054F fstp st(2)
00C10551 jp 00C10508
00C10553 jbe 00C10508
00C10555 fldz
00C10557 fcomip st,st(1)
00C10559 jp 00C10508
00C1055B jb 00C10508
00C1055D fxch st(1)
00C1055F fcomi st,st(1)
00C10561 jnp 00C10567
00C10563 fxch st(1)
00C10565 jmp 00C10508
00C10567 jbe 00C1056D
00C10569 fxch st(1)
00C1056B jmp 00C10508
00C1056D fstp st(1)
00C1056F fstp st(0)
00C10571 fstp st(0)
92: }
93:
94: s.Stop();
00C10573 mov ecx,esi
00C10575 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
00C1057A mov ecx,725B0994h
00C1057F call 00B930C8
00C10584 mov edx,eax
00C10586 mov eax,dword ptr [ebp-14h]
00C10589 mov dword ptr [edx+4],eax
00C1058C mov dword ptr [ebp-34h],edx
00C1058F mov ecx,725F3778h
00C10594 call 00B930C8
00C10599 mov dword ptr [ebp-38h],eax
00C1059C mov ecx,725F2C10h
00C105A1 call 00B930C8
00C105A6 mov dword ptr [ebp-3Ch],eax
00C105A9 mov ecx,esi
00C105AB call 71835820
00C105B0 push edx
00C105B1 push eax
00C105B2 push 0
00C105B4 push 2710h
00C105B9 call 736071A0
00C105BE mov dword ptr [ebp-48h],eax
00C105C1 mov dword ptr [ebp-44h],edx
00C105C4 fild qword ptr [ebp-48h]
00C105C7 fstp dword ptr [ebp-40h]
00C105CA fld dword ptr [ebp-40h]
00C105CD fdiv dword ptr ds:[0C10678h]
00C105D3 mov eax,dword ptr [ebp-38h]
00C105D6 fstp dword ptr [eax+4]
00C105D9 mov edx,dword ptr [ebp-38h]
00C105DC mov eax,dword ptr [ebp-3Ch]
00C105DF mov dword ptr [eax+4],ebx
00C105E2 mov dword ptr [eax+8],edi
00C105E5 mov esi,dword ptr [ebp-3Ch]
00C105E8 lea edi,[ebp-30h]
00C105EB xorps xmm0,xmm0
00C105EE movq mmword ptr [edi],xmm0
00C105F2 movq mmword ptr [edi+8],xmm0
00C105F7 push edx
00C105F8 push esi
00C105F9 lea ecx,[ebp-30h]
00C105FC mov edx,dword ptr [ebp-34h]
00C105FF call 724A2ED4
00C10604 lea eax,[ebp-30h]
00C10607 push dword ptr [eax+0Ch]
00C1060A push dword ptr [eax+8]
00C1060D push dword ptr [eax+4]
00C10610 push dword ptr [eax]
00C10612 mov edx,dword ptr ds:[3832310h]
00C10618 xor ecx,ecx
00C1061A call 72497A00
00C1061F mov ecx,eax
00C10621 call 72571934
61: for (int i = 4; i < 9; i++)
00C10626 inc dword ptr [ebp-14h]
00C10629 cmp dword ptr [ebp-14h],9
00C1062D jl 00C10496
97: }
98:
99: Console.WriteLine(loops);
00C10633 mov ecx,dword ptr [ebp-10h]
00C10636 call 72C583FC
100: Console.Write("Waiting...");
00C1063B mov ecx,dword ptr ds:[3832314h]
00C10641 call 724C67F0
00C10646 lea ecx,[ebp-20h]
00C10649 xor edx,edx
00C1064B call 72C57984
00C10650 lea esp,[ebp-0Ch]
00C10653 pop ebx
00C10654 pop esi
00C10655 pop edi
00C10656 pop ebp
00C10657 ret




//fast
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0550 or al,83h
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0552 ret
02FD0553 add dword ptr [ebx-3626FF29h],eax
02FD0559 fchs
02FD055B fxch st(1)
02FD055D fld st(0)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
02FD055F fldz
02FD0561 fcomip st,st(2)
02FD0563 fstp st(1)
02FD0565 jnp 02FD056B
02FD0567 fxch st(1)
02FD0569 jmp 02FD050B
02FD056B ja 02FD0571
02FD056D fxch st(1)
02FD056F jmp 02FD050B
02FD0571 fldz
02FD0573 fcomip st,st(2)
02FD0575 jnp 02FD057B
02FD0577 fxch st(1)
02FD0579 jmp 02FD050B
02FD057B jae 02FD0581
02FD057D fxch st(1)
02FD057F jmp 02FD050B
02FD0581 fcomi st,st(1)
02FD0583 jnp 02FD0589
02FD0585 fxch st(1)
02FD0587 jmp 02FD050B
02FD0589 jbe 02FD0592
02FD058B fxch st(1)
02FD058D jmp 02FD050B
02FD0592 fstp st(1)
02FD0594 fstp st(0)
92: }
93:
94: s.Stop();
02FD0596 mov ecx,esi
02FD0598 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
02FD059D mov ecx,725B0994h
02FD05A2 call 013830C8
02FD05A7 mov edx,eax
02FD05A9 mov eax,dword ptr [ebp-14h]
02FD05AC mov dword ptr [edx+4],eax
02FD05AF mov dword ptr [ebp-3Ch],edx
02FD05B2 mov ecx,725F3778h
02FD05B7 call 013830C8
02FD05BC mov dword ptr [ebp-40h],eax
02FD05BF mov ecx,725F2C10h
02FD05C4 call 013830C8
02FD05C9 mov dword ptr [ebp-44h],eax
02FD05CC mov ecx,esi
02FD05CE call 71835820
02FD05D3 push edx
02FD05D4 push eax
02FD05D5 push 0
02FD05D7 push 2710h
02FD05DC call 736071A0
02FD05E1 mov dword ptr [ebp-50h],eax
02FD05E4 mov dword ptr [ebp-4Ch],edx
02FD05E7 fild qword ptr [ebp-50h]
02FD05EA fstp dword ptr [ebp-48h]
02FD05ED fld dword ptr [ebp-48h]
02FD05F0 fdiv dword ptr ds:[2FD06A8h]
02FD05F6 mov eax,dword ptr [ebp-40h]
02FD05F9 fstp dword ptr [eax+4]
02FD05FC mov edx,dword ptr [ebp-40h]
02FD05FF mov eax,dword ptr [ebp-44h]
02FD0602 mov dword ptr [eax+4],ebx
02FD0605 mov dword ptr [eax+8],edi
02FD0608 mov esi,dword ptr [ebp-44h]
02FD060B lea edi,[ebp-38h]
02FD060E xorps xmm0,xmm0
02FD0611 movq mmword ptr [edi],xmm0
02FD0615 movq mmword ptr [edi+8],xmm0
02FD061A push edx
02FD061B push esi
02FD061C lea ecx,[ebp-38h]
02FD061F mov edx,dword ptr [ebp-3Ch]
02FD0622 call 724A2ED4
02FD0627 lea eax,[ebp-38h]
02FD062A push dword ptr [eax+0Ch]
02FD062D push dword ptr [eax+8]
02FD0630 push dword ptr [eax+4]
02FD0633 push dword ptr [eax]
02FD0635 mov edx,dword ptr ds:[4142310h]
02FD063B xor ecx,ecx
02FD063D call 72497A00
02FD0642 mov ecx,eax
02FD0644 call 72571934
61: for (int i = 4; i < 9; i++)
02FD0649 inc dword ptr [ebp-14h]
02FD064C cmp dword ptr [ebp-14h],9
02FD0650 jl 02FD0496
97: }
98:
99: Console.WriteLine(loops);
02FD0656 mov ecx,dword ptr [ebp-10h]
02FD0659 call 72C583FC
100: Console.Write("Waiting...");
02FD065E mov ecx,dword ptr ds:[4142314h]
02FD0664 call 724C67F0
02FD0669 lea ecx,[ebp-28h]
02FD066C xor edx,edx
02FD066E call 72C57984
02FD0673 lea esp,[ebp-0Ch]
02FD0676 pop ebx
02FD0677 pop esi
02FD0678 pop edi
02FD0679 pop ebp
02FD067A ret









c# performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 19 at 11:51









AndreyAkinshin

9,6782378133




9,6782378133










asked Jan 14 at 20:29









ImaplerImapler

1,079821




1,079821








  • 3





    I didn't check this in detail, but it all looks somewhat complicated, also the condition in the loop header. Are you entirely sure that there isn't a minor logical error causing the inner loop to run a different amount of times? You could insert a dumb counter variable in the inner loop an count the number of runs before/after just to be safe.

    – DanDan
    Jan 14 at 20:36








  • 1





    calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall - youtube.com/watch?v=HEfHFsfGXjs?

    – GSerg
    Jan 14 at 20:41






  • 3





    This is a very odd result! I cannot see any way that it can be cache misses. A better guess would be register scheduling in the FP registers. Are you doing your measurements in a release build and not in the debugger? Measuring performance in the debugger can cause register scheduling to be deoptimized.

    – Eric Lippert
    Jan 14 at 21:06






  • 1





    My best guess is register scheduling by the jitter then; it would be interesting to look at the generated code for both versions and see how they differ.

    – Eric Lippert
    Jan 14 at 21:19






  • 2





    It is a little tricky to get the generated code. Usually what I do in these situations is start the program outside of the debugger, but put a "readline" or some such thing that makes it pause just as it is about to finish; it has to be after all the code has jitted. Then while the program is paused, attach a debugger and force execution to step back into the method, and then look at the disassembly in the debugger. That way you definitely see the method as it was jitted without the debugger attached.

    – Eric Lippert
    Jan 14 at 21:25














  • 3





    I didn't check this in detail, but it all looks somewhat complicated, also the condition in the loop header. Are you entirely sure that there isn't a minor logical error causing the inner loop to run a different amount of times? You could insert a dumb counter variable in the inner loop an count the number of runs before/after just to be safe.

    – DanDan
    Jan 14 at 20:36








  • 1





    calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall - youtube.com/watch?v=HEfHFsfGXjs?

    – GSerg
    Jan 14 at 20:41






  • 3





    This is a very odd result! I cannot see any way that it can be cache misses. A better guess would be register scheduling in the FP registers. Are you doing your measurements in a release build and not in the debugger? Measuring performance in the debugger can cause register scheduling to be deoptimized.

    – Eric Lippert
    Jan 14 at 21:06






  • 1





    My best guess is register scheduling by the jitter then; it would be interesting to look at the generated code for both versions and see how they differ.

    – Eric Lippert
    Jan 14 at 21:19






  • 2





    It is a little tricky to get the generated code. Usually what I do in these situations is start the program outside of the debugger, but put a "readline" or some such thing that makes it pause just as it is about to finish; it has to be after all the code has jitted. Then while the program is paused, attach a debugger and force execution to step back into the method, and then look at the disassembly in the debugger. That way you definitely see the method as it was jitted without the debugger attached.

    – Eric Lippert
    Jan 14 at 21:25








3




3





I didn't check this in detail, but it all looks somewhat complicated, also the condition in the loop header. Are you entirely sure that there isn't a minor logical error causing the inner loop to run a different amount of times? You could insert a dumb counter variable in the inner loop an count the number of runs before/after just to be safe.

– DanDan
Jan 14 at 20:36







I didn't check this in detail, but it all looks somewhat complicated, also the condition in the loop header. Are you entirely sure that there isn't a minor logical error causing the inner loop to run a different amount of times? You could insert a dumb counter variable in the inner loop an count the number of runs before/after just to be safe.

– DanDan
Jan 14 at 20:36






1




1





calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall - youtube.com/watch?v=HEfHFsfGXjs?

– GSerg
Jan 14 at 20:41





calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall - youtube.com/watch?v=HEfHFsfGXjs?

– GSerg
Jan 14 at 20:41




3




3





This is a very odd result! I cannot see any way that it can be cache misses. A better guess would be register scheduling in the FP registers. Are you doing your measurements in a release build and not in the debugger? Measuring performance in the debugger can cause register scheduling to be deoptimized.

– Eric Lippert
Jan 14 at 21:06





This is a very odd result! I cannot see any way that it can be cache misses. A better guess would be register scheduling in the FP registers. Are you doing your measurements in a release build and not in the debugger? Measuring performance in the debugger can cause register scheduling to be deoptimized.

– Eric Lippert
Jan 14 at 21:06




1




1





My best guess is register scheduling by the jitter then; it would be interesting to look at the generated code for both versions and see how they differ.

– Eric Lippert
Jan 14 at 21:19





My best guess is register scheduling by the jitter then; it would be interesting to look at the generated code for both versions and see how they differ.

– Eric Lippert
Jan 14 at 21:19




2




2





It is a little tricky to get the generated code. Usually what I do in these situations is start the program outside of the debugger, but put a "readline" or some such thing that makes it pause just as it is about to finish; it has to be after all the code has jitted. Then while the program is paused, attach a debugger and force execution to step back into the method, and then look at the disassembly in the debugger. That way you definitely see the method as it was jitted without the debugger attached.

– Eric Lippert
Jan 14 at 21:25





It is a little tricky to get the generated code. Usually what I do in these situations is start the program outside of the debugger, but put a "readline" or some such thing that makes it pause just as it is about to finish; it has to be after all the code has jitted. Then while the program is paused, attach a debugger and force execution to step back into the method, and then look at the disassembly in the debugger. That way you definitely see the method as it was jitted without the debugger attached.

– Eric Lippert
Jan 14 at 21:25












2 Answers
2






active

oldest

votes


















9














I think the reason is CPU instruction pipelining. your slow equation depends on vs, that means vs must be calculated first, then vl is calculated.



but in your fast equation, more instructions can be pipelined as vs and vl can be calculated at same time because they don't depend on each other.



Please don't confuse this with multi threading. Instruction pipelining is some thing implemented at very low hardware level and tries to exploit as many CPU modules as possible at the same time to achieve maximum instruction throughput.






share|improve this answer


























  • Interesting idea, i hadnt thought about pipelining and i agree that there will be more divisions in the slow version which are expesive but it seems unlikely to make that large difference. I did a few simple test by double o = 1d/us and o=1d/vs seperatly and then used them in the if statement to prevent to from being removed by optimization. This did not make any significant difference on the runtime though. Do you have any other ideas how to test that?

    – Imapler
    Jan 14 at 22:24











  • instruction pipelining is something implemented at hardware level. so in order to test that you should run this code on a processor without pipelining feature and compare slow and fast versions on that computer.

    – M.kazem Akhgary
    Jan 14 at 22:30






  • 1





    @Imapler for division i did not notice one thing important. your fast code has literally no division inside the loop.

    – M.kazem Akhgary
    Jan 14 at 22:33











  • @Imapler ok, division is not the reason. both equation divisions are optimized away outside of the loop. i guess only thing that remains is instruction pipelining.

    – M.kazem Akhgary
    Jan 14 at 22:35













  • Good catch however the compiler does not move it outside the loop. I precalculated the 1d/(ms+mL)and how it takes 2/3 of the time.

    – Imapler
    Jan 14 at 22:37



















5














You calculations are not equal



double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
double vL = uL + ms * (us - vs) / mL; //slow


Example: I miss vs in the fast version



I expect your while loop doing more iterations because of this?






share|improve this answer


























  • Good thought, i ran the experiment and both variants run 174531180 iterations so no difference there. The equations looks quite different but they are derived using the conservation of momentum and energy. The difference is that the slow one uses the result from the first calculations while the fast recalculates everything.

    – Imapler
    Jan 14 at 21:03











  • And if you do both computations and compare the two values, they are always within a very small fraction of each other. They appear to be equivalent computations.

    – Eric Lippert
    Jan 14 at 21:10






  • 2





    @EricLippert Math.Abs(vL1 - vL2) > 0.001f is never true

    – Imapler
    Jan 14 at 21:14






  • 1





    They are mathematically equal, you can prove that just by algebra operations.

    – Antonín Lejsek
    Jan 15 at 1:45











  • I double checked with equation solver fast and slow are not equal: vL = uL + (2 * ms * (us - uL)) / (ms + mL); //slow vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast

    – Aldert
    Jan 17 at 19:44











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54188731%2fsurprisingly-different-performance-of-simple-c-sharp-program%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









9














I think the reason is CPU instruction pipelining. your slow equation depends on vs, that means vs must be calculated first, then vl is calculated.



but in your fast equation, more instructions can be pipelined as vs and vl can be calculated at same time because they don't depend on each other.



Please don't confuse this with multi threading. Instruction pipelining is some thing implemented at very low hardware level and tries to exploit as many CPU modules as possible at the same time to achieve maximum instruction throughput.






share|improve this answer


























  • Interesting idea, i hadnt thought about pipelining and i agree that there will be more divisions in the slow version which are expesive but it seems unlikely to make that large difference. I did a few simple test by double o = 1d/us and o=1d/vs seperatly and then used them in the if statement to prevent to from being removed by optimization. This did not make any significant difference on the runtime though. Do you have any other ideas how to test that?

    – Imapler
    Jan 14 at 22:24











  • instruction pipelining is something implemented at hardware level. so in order to test that you should run this code on a processor without pipelining feature and compare slow and fast versions on that computer.

    – M.kazem Akhgary
    Jan 14 at 22:30






  • 1





    @Imapler for division i did not notice one thing important. your fast code has literally no division inside the loop.

    – M.kazem Akhgary
    Jan 14 at 22:33











  • @Imapler ok, division is not the reason. both equation divisions are optimized away outside of the loop. i guess only thing that remains is instruction pipelining.

    – M.kazem Akhgary
    Jan 14 at 22:35













  • Good catch however the compiler does not move it outside the loop. I precalculated the 1d/(ms+mL)and how it takes 2/3 of the time.

    – Imapler
    Jan 14 at 22:37
















9














I think the reason is CPU instruction pipelining. your slow equation depends on vs, that means vs must be calculated first, then vl is calculated.



but in your fast equation, more instructions can be pipelined as vs and vl can be calculated at same time because they don't depend on each other.



Please don't confuse this with multi threading. Instruction pipelining is some thing implemented at very low hardware level and tries to exploit as many CPU modules as possible at the same time to achieve maximum instruction throughput.






share|improve this answer


























  • Interesting idea, i hadnt thought about pipelining and i agree that there will be more divisions in the slow version which are expesive but it seems unlikely to make that large difference. I did a few simple test by double o = 1d/us and o=1d/vs seperatly and then used them in the if statement to prevent to from being removed by optimization. This did not make any significant difference on the runtime though. Do you have any other ideas how to test that?

    – Imapler
    Jan 14 at 22:24











  • instruction pipelining is something implemented at hardware level. so in order to test that you should run this code on a processor without pipelining feature and compare slow and fast versions on that computer.

    – M.kazem Akhgary
    Jan 14 at 22:30






  • 1





    @Imapler for division i did not notice one thing important. your fast code has literally no division inside the loop.

    – M.kazem Akhgary
    Jan 14 at 22:33











  • @Imapler ok, division is not the reason. both equation divisions are optimized away outside of the loop. i guess only thing that remains is instruction pipelining.

    – M.kazem Akhgary
    Jan 14 at 22:35













  • Good catch however the compiler does not move it outside the loop. I precalculated the 1d/(ms+mL)and how it takes 2/3 of the time.

    – Imapler
    Jan 14 at 22:37














9












9








9







I think the reason is CPU instruction pipelining. your slow equation depends on vs, that means vs must be calculated first, then vl is calculated.



but in your fast equation, more instructions can be pipelined as vs and vl can be calculated at same time because they don't depend on each other.



Please don't confuse this with multi threading. Instruction pipelining is some thing implemented at very low hardware level and tries to exploit as many CPU modules as possible at the same time to achieve maximum instruction throughput.






share|improve this answer















I think the reason is CPU instruction pipelining. your slow equation depends on vs, that means vs must be calculated first, then vl is calculated.



but in your fast equation, more instructions can be pipelined as vs and vl can be calculated at same time because they don't depend on each other.



Please don't confuse this with multi threading. Instruction pipelining is some thing implemented at very low hardware level and tries to exploit as many CPU modules as possible at the same time to achieve maximum instruction throughput.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 14 at 23:29

























answered Jan 14 at 22:03









M.kazem AkhgaryM.kazem Akhgary

12k53372




12k53372













  • Interesting idea, i hadnt thought about pipelining and i agree that there will be more divisions in the slow version which are expesive but it seems unlikely to make that large difference. I did a few simple test by double o = 1d/us and o=1d/vs seperatly and then used them in the if statement to prevent to from being removed by optimization. This did not make any significant difference on the runtime though. Do you have any other ideas how to test that?

    – Imapler
    Jan 14 at 22:24











  • instruction pipelining is something implemented at hardware level. so in order to test that you should run this code on a processor without pipelining feature and compare slow and fast versions on that computer.

    – M.kazem Akhgary
    Jan 14 at 22:30






  • 1





    @Imapler for division i did not notice one thing important. your fast code has literally no division inside the loop.

    – M.kazem Akhgary
    Jan 14 at 22:33











  • @Imapler ok, division is not the reason. both equation divisions are optimized away outside of the loop. i guess only thing that remains is instruction pipelining.

    – M.kazem Akhgary
    Jan 14 at 22:35













  • Good catch however the compiler does not move it outside the loop. I precalculated the 1d/(ms+mL)and how it takes 2/3 of the time.

    – Imapler
    Jan 14 at 22:37



















  • Interesting idea, i hadnt thought about pipelining and i agree that there will be more divisions in the slow version which are expesive but it seems unlikely to make that large difference. I did a few simple test by double o = 1d/us and o=1d/vs seperatly and then used them in the if statement to prevent to from being removed by optimization. This did not make any significant difference on the runtime though. Do you have any other ideas how to test that?

    – Imapler
    Jan 14 at 22:24











  • instruction pipelining is something implemented at hardware level. so in order to test that you should run this code on a processor without pipelining feature and compare slow and fast versions on that computer.

    – M.kazem Akhgary
    Jan 14 at 22:30






  • 1





    @Imapler for division i did not notice one thing important. your fast code has literally no division inside the loop.

    – M.kazem Akhgary
    Jan 14 at 22:33











  • @Imapler ok, division is not the reason. both equation divisions are optimized away outside of the loop. i guess only thing that remains is instruction pipelining.

    – M.kazem Akhgary
    Jan 14 at 22:35













  • Good catch however the compiler does not move it outside the loop. I precalculated the 1d/(ms+mL)and how it takes 2/3 of the time.

    – Imapler
    Jan 14 at 22:37

















Interesting idea, i hadnt thought about pipelining and i agree that there will be more divisions in the slow version which are expesive but it seems unlikely to make that large difference. I did a few simple test by double o = 1d/us and o=1d/vs seperatly and then used them in the if statement to prevent to from being removed by optimization. This did not make any significant difference on the runtime though. Do you have any other ideas how to test that?

– Imapler
Jan 14 at 22:24





Interesting idea, i hadnt thought about pipelining and i agree that there will be more divisions in the slow version which are expesive but it seems unlikely to make that large difference. I did a few simple test by double o = 1d/us and o=1d/vs seperatly and then used them in the if statement to prevent to from being removed by optimization. This did not make any significant difference on the runtime though. Do you have any other ideas how to test that?

– Imapler
Jan 14 at 22:24













instruction pipelining is something implemented at hardware level. so in order to test that you should run this code on a processor without pipelining feature and compare slow and fast versions on that computer.

– M.kazem Akhgary
Jan 14 at 22:30





instruction pipelining is something implemented at hardware level. so in order to test that you should run this code on a processor without pipelining feature and compare slow and fast versions on that computer.

– M.kazem Akhgary
Jan 14 at 22:30




1




1





@Imapler for division i did not notice one thing important. your fast code has literally no division inside the loop.

– M.kazem Akhgary
Jan 14 at 22:33





@Imapler for division i did not notice one thing important. your fast code has literally no division inside the loop.

– M.kazem Akhgary
Jan 14 at 22:33













@Imapler ok, division is not the reason. both equation divisions are optimized away outside of the loop. i guess only thing that remains is instruction pipelining.

– M.kazem Akhgary
Jan 14 at 22:35







@Imapler ok, division is not the reason. both equation divisions are optimized away outside of the loop. i guess only thing that remains is instruction pipelining.

– M.kazem Akhgary
Jan 14 at 22:35















Good catch however the compiler does not move it outside the loop. I precalculated the 1d/(ms+mL)and how it takes 2/3 of the time.

– Imapler
Jan 14 at 22:37





Good catch however the compiler does not move it outside the loop. I precalculated the 1d/(ms+mL)and how it takes 2/3 of the time.

– Imapler
Jan 14 at 22:37













5














You calculations are not equal



double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
double vL = uL + ms * (us - vs) / mL; //slow


Example: I miss vs in the fast version



I expect your while loop doing more iterations because of this?






share|improve this answer


























  • Good thought, i ran the experiment and both variants run 174531180 iterations so no difference there. The equations looks quite different but they are derived using the conservation of momentum and energy. The difference is that the slow one uses the result from the first calculations while the fast recalculates everything.

    – Imapler
    Jan 14 at 21:03











  • And if you do both computations and compare the two values, they are always within a very small fraction of each other. They appear to be equivalent computations.

    – Eric Lippert
    Jan 14 at 21:10






  • 2





    @EricLippert Math.Abs(vL1 - vL2) > 0.001f is never true

    – Imapler
    Jan 14 at 21:14






  • 1





    They are mathematically equal, you can prove that just by algebra operations.

    – Antonín Lejsek
    Jan 15 at 1:45











  • I double checked with equation solver fast and slow are not equal: vL = uL + (2 * ms * (us - uL)) / (ms + mL); //slow vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast

    – Aldert
    Jan 17 at 19:44
















5














You calculations are not equal



double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
double vL = uL + ms * (us - vs) / mL; //slow


Example: I miss vs in the fast version



I expect your while loop doing more iterations because of this?






share|improve this answer


























  • Good thought, i ran the experiment and both variants run 174531180 iterations so no difference there. The equations looks quite different but they are derived using the conservation of momentum and energy. The difference is that the slow one uses the result from the first calculations while the fast recalculates everything.

    – Imapler
    Jan 14 at 21:03











  • And if you do both computations and compare the two values, they are always within a very small fraction of each other. They appear to be equivalent computations.

    – Eric Lippert
    Jan 14 at 21:10






  • 2





    @EricLippert Math.Abs(vL1 - vL2) > 0.001f is never true

    – Imapler
    Jan 14 at 21:14






  • 1





    They are mathematically equal, you can prove that just by algebra operations.

    – Antonín Lejsek
    Jan 15 at 1:45











  • I double checked with equation solver fast and slow are not equal: vL = uL + (2 * ms * (us - uL)) / (ms + mL); //slow vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast

    – Aldert
    Jan 17 at 19:44














5












5








5







You calculations are not equal



double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
double vL = uL + ms * (us - vs) / mL; //slow


Example: I miss vs in the fast version



I expect your while loop doing more iterations because of this?






share|improve this answer















You calculations are not equal



double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
double vL = uL + ms * (us - vs) / mL; //slow


Example: I miss vs in the fast version



I expect your while loop doing more iterations because of this?







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 14 at 20:50









AdrianHHH

11.3k124067




11.3k124067










answered Jan 14 at 20:42









AldertAldert

1,3681212




1,3681212













  • Good thought, i ran the experiment and both variants run 174531180 iterations so no difference there. The equations looks quite different but they are derived using the conservation of momentum and energy. The difference is that the slow one uses the result from the first calculations while the fast recalculates everything.

    – Imapler
    Jan 14 at 21:03











  • And if you do both computations and compare the two values, they are always within a very small fraction of each other. They appear to be equivalent computations.

    – Eric Lippert
    Jan 14 at 21:10






  • 2





    @EricLippert Math.Abs(vL1 - vL2) > 0.001f is never true

    – Imapler
    Jan 14 at 21:14






  • 1





    They are mathematically equal, you can prove that just by algebra operations.

    – Antonín Lejsek
    Jan 15 at 1:45











  • I double checked with equation solver fast and slow are not equal: vL = uL + (2 * ms * (us - uL)) / (ms + mL); //slow vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast

    – Aldert
    Jan 17 at 19:44



















  • Good thought, i ran the experiment and both variants run 174531180 iterations so no difference there. The equations looks quite different but they are derived using the conservation of momentum and energy. The difference is that the slow one uses the result from the first calculations while the fast recalculates everything.

    – Imapler
    Jan 14 at 21:03











  • And if you do both computations and compare the two values, they are always within a very small fraction of each other. They appear to be equivalent computations.

    – Eric Lippert
    Jan 14 at 21:10






  • 2





    @EricLippert Math.Abs(vL1 - vL2) > 0.001f is never true

    – Imapler
    Jan 14 at 21:14






  • 1





    They are mathematically equal, you can prove that just by algebra operations.

    – Antonín Lejsek
    Jan 15 at 1:45











  • I double checked with equation solver fast and slow are not equal: vL = uL + (2 * ms * (us - uL)) / (ms + mL); //slow vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast

    – Aldert
    Jan 17 at 19:44

















Good thought, i ran the experiment and both variants run 174531180 iterations so no difference there. The equations looks quite different but they are derived using the conservation of momentum and energy. The difference is that the slow one uses the result from the first calculations while the fast recalculates everything.

– Imapler
Jan 14 at 21:03





Good thought, i ran the experiment and both variants run 174531180 iterations so no difference there. The equations looks quite different but they are derived using the conservation of momentum and energy. The difference is that the slow one uses the result from the first calculations while the fast recalculates everything.

– Imapler
Jan 14 at 21:03













And if you do both computations and compare the two values, they are always within a very small fraction of each other. They appear to be equivalent computations.

– Eric Lippert
Jan 14 at 21:10





And if you do both computations and compare the two values, they are always within a very small fraction of each other. They appear to be equivalent computations.

– Eric Lippert
Jan 14 at 21:10




2




2





@EricLippert Math.Abs(vL1 - vL2) > 0.001f is never true

– Imapler
Jan 14 at 21:14





@EricLippert Math.Abs(vL1 - vL2) > 0.001f is never true

– Imapler
Jan 14 at 21:14




1




1





They are mathematically equal, you can prove that just by algebra operations.

– Antonín Lejsek
Jan 15 at 1:45





They are mathematically equal, you can prove that just by algebra operations.

– Antonín Lejsek
Jan 15 at 1:45













I double checked with equation solver fast and slow are not equal: vL = uL + (2 * ms * (us - uL)) / (ms + mL); //slow vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast

– Aldert
Jan 17 at 19:44





I double checked with equation solver fast and slow are not equal: vL = uL + (2 * ms * (us - uL)) / (ms + mL); //slow vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast

– Aldert
Jan 17 at 19:44


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54188731%2fsurprisingly-different-performance-of-simple-c-sharp-program%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

flock() on closed filehandle LOCK_FILE at /usr/bin/apt-mirror

Mangá

 ⁒  ․,‪⁊‑⁙ ⁖, ⁇‒※‌, †,⁖‗‌⁝    ‾‸⁘,‖⁔⁣,⁂‾
”‑,‥–,‬ ,⁀‹⁋‴⁑ ‒ ,‴⁋”‼ ⁨,‷⁔„ ‰′,‐‚ ‥‡‎“‷⁃⁨⁅⁣,⁔
⁇‘⁔⁡⁏⁌⁡‿‶‏⁨ ⁣⁕⁖⁨⁩⁥‽⁀  ‴‬⁜‟ ⁃‣‧⁕‮ …‍⁨‴ ⁩,⁚⁖‫ ,‵ ⁀,‮⁝‣‣ ⁑  ⁂– ․, ‾‽ ‏⁁“⁗‸ ‾… ‹‡⁌⁎‸‘ ‡⁏⁌‪ ‵⁛ ‎⁨ ―⁦⁤⁄⁕