Performance impact of array writes is much larger than expected
up vote
1
down vote
favorite
I've stumbled upon this effect when debugging an application - see the repro code below.
It gives me the following results:
Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs
As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?
UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.
class Program
static void Main(string args)
var data = InitData();
TestPerf0(data, false);
TestPerf0(data, true);
TestPerf1(data, false);
TestPerf1(data, true);
if (Debugger.IsAttached)
Console.ReadKey();
private static string InitData()
var watch = Stopwatch.StartNew();
var data = new string[100_000];
var maxString = 10_000;
for (int i = 0; i < data.Length; i++)
data[i] = new string('-', maxString);
watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");
return data;
private static void TestPerf1(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[char.MaxValue];
int tmp = 0;
for (var j = 0; ; j++)
var allEmpty = true;
for (var i = 0; i < vals.Length; i++)
var val = vals[i];
if (j < val.Length)
allEmpty = false;
var ch = val[j];
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
if (allEmpty)
break;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
private static void TestPerf0(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[65536];
int tmp = 0;
for (var i = 0; i < 1_000_000_000; i++)
var j = i % counters.Length;
var count = counters[j];
tmp ^= count;
if (testStore)
counters[j] = count + 1;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");
c# .net performance
|
show 2 more comments
up vote
1
down vote
favorite
I've stumbled upon this effect when debugging an application - see the repro code below.
It gives me the following results:
Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs
As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?
UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.
class Program
static void Main(string args)
var data = InitData();
TestPerf0(data, false);
TestPerf0(data, true);
TestPerf1(data, false);
TestPerf1(data, true);
if (Debugger.IsAttached)
Console.ReadKey();
private static string InitData()
var watch = Stopwatch.StartNew();
var data = new string[100_000];
var maxString = 10_000;
for (int i = 0; i < data.Length; i++)
data[i] = new string('-', maxString);
watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");
return data;
private static void TestPerf1(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[char.MaxValue];
int tmp = 0;
for (var j = 0; ; j++)
var allEmpty = true;
for (var i = 0; i < vals.Length; i++)
var val = vals[i];
if (j < val.Length)
allEmpty = false;
var ch = val[j];
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
if (allEmpty)
break;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
private static void TestPerf0(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[65536];
int tmp = 0;
for (var i = 0; i < 1_000_000_000; i++)
var j = i % counters.Length;
var count = counters[j];
tmp ^= count;
if (testStore)
counters[j] = count + 1;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");
c# .net performance
Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash oftmp
and check the time again?
– Alex
Nov 9 at 4:35
@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the linevar ch = val[j];
and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38
When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56
3
@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02
2
@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for thePerfTest1
function) no matter iftestStore
istrue
orfalse
.
– Alex
Nov 9 at 7:10
|
show 2 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I've stumbled upon this effect when debugging an application - see the repro code below.
It gives me the following results:
Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs
As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?
UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.
class Program
static void Main(string args)
var data = InitData();
TestPerf0(data, false);
TestPerf0(data, true);
TestPerf1(data, false);
TestPerf1(data, true);
if (Debugger.IsAttached)
Console.ReadKey();
private static string InitData()
var watch = Stopwatch.StartNew();
var data = new string[100_000];
var maxString = 10_000;
for (int i = 0; i < data.Length; i++)
data[i] = new string('-', maxString);
watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");
return data;
private static void TestPerf1(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[char.MaxValue];
int tmp = 0;
for (var j = 0; ; j++)
var allEmpty = true;
for (var i = 0; i < vals.Length; i++)
var val = vals[i];
if (j < val.Length)
allEmpty = false;
var ch = val[j];
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
if (allEmpty)
break;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
private static void TestPerf0(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[65536];
int tmp = 0;
for (var i = 0; i < 1_000_000_000; i++)
var j = i % counters.Length;
var count = counters[j];
tmp ^= count;
if (testStore)
counters[j] = count + 1;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");
c# .net performance
I've stumbled upon this effect when debugging an application - see the repro code below.
It gives me the following results:
Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs
As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?
UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.
class Program
static void Main(string args)
var data = InitData();
TestPerf0(data, false);
TestPerf0(data, true);
TestPerf1(data, false);
TestPerf1(data, true);
if (Debugger.IsAttached)
Console.ReadKey();
private static string InitData()
var watch = Stopwatch.StartNew();
var data = new string[100_000];
var maxString = 10_000;
for (int i = 0; i < data.Length; i++)
data[i] = new string('-', maxString);
watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");
return data;
private static void TestPerf1(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[char.MaxValue];
int tmp = 0;
for (var j = 0; ; j++)
var allEmpty = true;
for (var i = 0; i < vals.Length; i++)
var val = vals[i];
if (j < val.Length)
allEmpty = false;
var ch = val[j];
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
if (allEmpty)
break;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
private static void TestPerf0(string vals, bool testStore)
var watch = Stopwatch.StartNew();
var counters = new int[65536];
int tmp = 0;
for (var i = 0; i < 1_000_000_000; i++)
var j = i % counters.Length;
var count = counters[j];
tmp ^= count;
if (testStore)
counters[j] = count + 1;
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");
c# .net performance
c# .net performance
edited Nov 11 at 3:40
asked Nov 9 at 4:24
user626528
6,2842057117
6,2842057117
Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash oftmp
and check the time again?
– Alex
Nov 9 at 4:35
@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the linevar ch = val[j];
and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38
When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56
3
@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02
2
@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for thePerfTest1
function) no matter iftestStore
istrue
orfalse
.
– Alex
Nov 9 at 7:10
|
show 2 more comments
Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash oftmp
and check the time again?
– Alex
Nov 9 at 4:35
@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the linevar ch = val[j];
and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38
When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56
3
@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02
2
@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for thePerfTest1
function) no matter iftestStore
istrue
orfalse
.
– Alex
Nov 9 at 7:10
Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of
tmp
and check the time again?– Alex
Nov 9 at 4:35
Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of
tmp
and check the time again?– Alex
Nov 9 at 4:35
@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line
var ch = val[j];
and a few lines below, and see the real difference when the code is not there.– user626528
Nov 9 at 4:38
@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line
var ch = val[j];
and a few lines below, and see the real difference when the code is not there.– user626528
Nov 9 at 4:38
When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56
When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56
3
3
@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02
@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02
2
2
@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the
PerfTest1
function) no matter if testStore
is true
or false
.– Alex
Nov 9 at 7:10
@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the
PerfTest1
function) no matter if testStore
is true
or false
.– Alex
Nov 9 at 7:10
|
show 2 more comments
1 Answer
1
active
oldest
votes
up vote
6
down vote
After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
- Invert your
i
andj
loop. This will remove theallEmpty
variable completely. - Cast
ch
toint
withvar ch = (int) val[j];
- because you ALWAYS use it as index. - Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
- As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
foreach (int ch in val)
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if
(to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.
"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22
1
In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42
1
Nice analysis @Alex.
– mjwills
Nov 9 at 6:44
1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21
1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg.if(testStore) dummy = counters[ch] + 1;
). I still would assume that the "last" access tocounters
(regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
6
down vote
After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
- Invert your
i
andj
loop. This will remove theallEmpty
variable completely. - Cast
ch
toint
withvar ch = (int) val[j];
- because you ALWAYS use it as index. - Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
- As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
foreach (int ch in val)
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if
(to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.
"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22
1
In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42
1
Nice analysis @Alex.
– mjwills
Nov 9 at 6:44
1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21
1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg.if(testStore) dummy = counters[ch] + 1;
). I still would assume that the "last" access tocounters
(regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23
add a comment |
up vote
6
down vote
After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
- Invert your
i
andj
loop. This will remove theallEmpty
variable completely. - Cast
ch
toint
withvar ch = (int) val[j];
- because you ALWAYS use it as index. - Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
- As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
foreach (int ch in val)
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if
(to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.
"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22
1
In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42
1
Nice analysis @Alex.
– mjwills
Nov 9 at 6:44
1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21
1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg.if(testStore) dummy = counters[ch] + 1;
). I still would assume that the "last" access tocounters
(regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23
add a comment |
up vote
6
down vote
up vote
6
down vote
After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
- Invert your
i
andj
loop. This will remove theallEmpty
variable completely. - Cast
ch
toint
withvar ch = (int) val[j];
- because you ALWAYS use it as index. - Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
- As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
foreach (int ch in val)
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if
(to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.
After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
- Invert your
i
andj
loop. This will remove theallEmpty
variable completely. - Cast
ch
toint
withvar ch = (int) val[j];
- because you ALWAYS use it as index. - Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
- As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
foreach (int ch in val)
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if
(to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.
edited Nov 9 at 7:01
answered Nov 9 at 6:13
Alex
8201327
8201327
"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22
1
In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42
1
Nice analysis @Alex.
– mjwills
Nov 9 at 6:44
1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21
1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg.if(testStore) dummy = counters[ch] + 1;
). I still would assume that the "last" access tocounters
(regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23
add a comment |
"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22
1
In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42
1
Nice analysis @Alex.
– mjwills
Nov 9 at 6:44
1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21
1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg.if(testStore) dummy = counters[ch] + 1;
). I still would assume that the "last" access tocounters
(regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23
"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22
"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22
1
1
In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42
In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42
1
1
Nice analysis @Alex.
– mjwills
Nov 9 at 6:44
Nice analysis @Alex.
– mjwills
Nov 9 at 6:44
1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21
1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21
1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg.
if(testStore) dummy = counters[ch] + 1;
). I still would assume that the "last" access to counters
(regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.– Alex
Nov 10 at 1:23
1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg.
if(testStore) dummy = counters[ch] + 1;
). I still would assume that the "last" access to counters
(regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.– Alex
Nov 10 at 1:23
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219858%2fperformance-impact-of-array-writes-is-much-larger-than-expected%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of
tmp
and check the time again?– Alex
Nov 9 at 4:35
@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line
var ch = val[j];
and a few lines below, and see the real difference when the code is not there.– user626528
Nov 9 at 4:38
When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56
3
@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02
2
@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the
PerfTest1
function) no matter iftestStore
istrue
orfalse
.– Alex
Nov 9 at 7:10