Performance impact of array writes is much larger than expected









up vote
1
down vote

favorite
1












I've stumbled upon this effect when debugging an application - see the repro code below.



It gives me the following results:



Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs



As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?



UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.



class Program

static void Main(string args)

var data = InitData();

TestPerf0(data, false);
TestPerf0(data, true);

TestPerf1(data, false);
TestPerf1(data, true);

if (Debugger.IsAttached)
Console.ReadKey();


private static string InitData()

var watch = Stopwatch.StartNew();

var data = new string[100_000];
var maxString = 10_000;

for (int i = 0; i < data.Length; i++)

data[i] = new string('-', maxString);


watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

return data;


private static void TestPerf1(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[char.MaxValue];
int tmp = 0;

for (var j = 0; ; j++)

var allEmpty = true;

for (var i = 0; i < vals.Length; i++)

var val = vals[i];

if (j < val.Length)

allEmpty = false;

var ch = val[j];
var count = counters[ch];
tmp ^= count;

if (testStore)
counters[ch] = count + 1;



if (allEmpty)
break;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");


private static void TestPerf0(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[65536];
int tmp = 0;

for (var i = 0; i < 1_000_000_000; i++)

var j = i % counters.Length;
var count = counters[j];
tmp ^= count;

if (testStore)
counters[j] = count + 1;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");











share|improve this question























  • Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
    – Alex
    Nov 9 at 4:35










  • @Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
    – user626528
    Nov 9 at 4:38










  • When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
    – Michael Burr
    Nov 9 at 4:56






  • 3




    @user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
    – Alex
    Nov 9 at 5:02







  • 2




    @BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
    – Alex
    Nov 9 at 7:10














up vote
1
down vote

favorite
1












I've stumbled upon this effect when debugging an application - see the repro code below.



It gives me the following results:



Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs



As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?



UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.



class Program

static void Main(string args)

var data = InitData();

TestPerf0(data, false);
TestPerf0(data, true);

TestPerf1(data, false);
TestPerf1(data, true);

if (Debugger.IsAttached)
Console.ReadKey();


private static string InitData()

var watch = Stopwatch.StartNew();

var data = new string[100_000];
var maxString = 10_000;

for (int i = 0; i < data.Length; i++)

data[i] = new string('-', maxString);


watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

return data;


private static void TestPerf1(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[char.MaxValue];
int tmp = 0;

for (var j = 0; ; j++)

var allEmpty = true;

for (var i = 0; i < vals.Length; i++)

var val = vals[i];

if (j < val.Length)

allEmpty = false;

var ch = val[j];
var count = counters[ch];
tmp ^= count;

if (testStore)
counters[ch] = count + 1;



if (allEmpty)
break;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");


private static void TestPerf0(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[65536];
int tmp = 0;

for (var i = 0; i < 1_000_000_000; i++)

var j = i % counters.Length;
var count = counters[j];
tmp ^= count;

if (testStore)
counters[j] = count + 1;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");











share|improve this question























  • Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
    – Alex
    Nov 9 at 4:35










  • @Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
    – user626528
    Nov 9 at 4:38










  • When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
    – Michael Burr
    Nov 9 at 4:56






  • 3




    @user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
    – Alex
    Nov 9 at 5:02







  • 2




    @BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
    – Alex
    Nov 9 at 7:10












up vote
1
down vote

favorite
1









up vote
1
down vote

favorite
1






1





I've stumbled upon this effect when debugging an application - see the repro code below.



It gives me the following results:



Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs



As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?



UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.



class Program

static void Main(string args)

var data = InitData();

TestPerf0(data, false);
TestPerf0(data, true);

TestPerf1(data, false);
TestPerf1(data, true);

if (Debugger.IsAttached)
Console.ReadKey();


private static string InitData()

var watch = Stopwatch.StartNew();

var data = new string[100_000];
var maxString = 10_000;

for (int i = 0; i < data.Length; i++)

data[i] = new string('-', maxString);


watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

return data;


private static void TestPerf1(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[char.MaxValue];
int tmp = 0;

for (var j = 0; ; j++)

var allEmpty = true;

for (var i = 0; i < vals.Length; i++)

var val = vals[i];

if (j < val.Length)

allEmpty = false;

var ch = val[j];
var count = counters[ch];
tmp ^= count;

if (testStore)
counters[ch] = count + 1;



if (allEmpty)
break;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");


private static void TestPerf0(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[65536];
int tmp = 0;

for (var i = 0; i < 1_000_000_000; i++)

var j = i % counters.Length;
var count = counters[j];
tmp ^= count;

if (testStore)
counters[j] = count + 1;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");











share|improve this question















I've stumbled upon this effect when debugging an application - see the repro code below.



It gives me the following results:



Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs



As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?



UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.



class Program

static void Main(string args)

var data = InitData();

TestPerf0(data, false);
TestPerf0(data, true);

TestPerf1(data, false);
TestPerf1(data, true);

if (Debugger.IsAttached)
Console.ReadKey();


private static string InitData()

var watch = Stopwatch.StartNew();

var data = new string[100_000];
var maxString = 10_000;

for (int i = 0; i < data.Length; i++)

data[i] = new string('-', maxString);


watch.Stop();
Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

return data;


private static void TestPerf1(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[char.MaxValue];
int tmp = 0;

for (var j = 0; ; j++)

var allEmpty = true;

for (var i = 0; i < vals.Length; i++)

var val = vals[i];

if (j < val.Length)

allEmpty = false;

var ch = val[j];
var count = counters[ch];
tmp ^= count;

if (testStore)
counters[ch] = count + 1;



if (allEmpty)
break;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");


private static void TestPerf0(string vals, bool testStore)

var watch = Stopwatch.StartNew();

var counters = new int[65536];
int tmp = 0;

for (var i = 0; i < 1_000_000_000; i++)

var j = i % counters.Length;
var count = counters[j];
tmp ^= count;

if (testStore)
counters[j] = count + 1;


// prevent the compiler from optimizing away our computations
tmp.GetHashCode();

watch.Stop();
Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");








c# .net performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 3:40

























asked Nov 9 at 4:24









user626528

6,2842057117




6,2842057117











  • Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
    – Alex
    Nov 9 at 4:35










  • @Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
    – user626528
    Nov 9 at 4:38










  • When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
    – Michael Burr
    Nov 9 at 4:56






  • 3




    @user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
    – Alex
    Nov 9 at 5:02







  • 2




    @BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
    – Alex
    Nov 9 at 7:10
















  • Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
    – Alex
    Nov 9 at 4:35










  • @Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
    – user626528
    Nov 9 at 4:38










  • When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
    – Michael Burr
    Nov 9 at 4:56






  • 3




    @user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
    – Alex
    Nov 9 at 5:02







  • 2




    @BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
    – Alex
    Nov 9 at 7:10















Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
– Alex
Nov 9 at 4:35




Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
– Alex
Nov 9 at 4:35












@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38




@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38












When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56




When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56




3




3




@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02





@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02





2




2




@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
– Alex
Nov 9 at 7:10




@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
– Alex
Nov 9 at 7:10












1 Answer
1






active

oldest

votes

















up vote
6
down vote













After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:



if (testStore)
counters[ch] = count + 1;


might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).



After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.



enter image description here



Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):



  1. Invert your i and j loop. This will remove the allEmpty variable completely.

  2. Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

  3. Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

  4. As this is a "test setting" which the name suggests, is this important at all? Just remove it.


EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:



foreach (var val in vals)

foreach (int ch in val)

var count = counters[ch];
tmp ^= count;
if (testStore)

counters[ch] = count + 1;





I come from runtimes like this:



enter image description here



to runtimes like this:



enter image description here



Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.




EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:



for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];


you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.



In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).



This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.






share|improve this answer






















  • "might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
    – user626528
    Nov 9 at 6:22






  • 1




    In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
    – Alex
    Nov 9 at 6:42






  • 1




    Nice analysis @Alex.
    – mjwills
    Nov 9 at 6:44











  • 1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
    – user626528
    Nov 9 at 15:21











  • 1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
    – Alex
    Nov 10 at 1:23










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219858%2fperformance-impact-of-array-writes-is-much-larger-than-expected%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
6
down vote













After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:



if (testStore)
counters[ch] = count + 1;


might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).



After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.



enter image description here



Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):



  1. Invert your i and j loop. This will remove the allEmpty variable completely.

  2. Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

  3. Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

  4. As this is a "test setting" which the name suggests, is this important at all? Just remove it.


EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:



foreach (var val in vals)

foreach (int ch in val)

var count = counters[ch];
tmp ^= count;
if (testStore)

counters[ch] = count + 1;





I come from runtimes like this:



enter image description here



to runtimes like this:



enter image description here



Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.




EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:



for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];


you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.



In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).



This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.






share|improve this answer






















  • "might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
    – user626528
    Nov 9 at 6:22






  • 1




    In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
    – Alex
    Nov 9 at 6:42






  • 1




    Nice analysis @Alex.
    – mjwills
    Nov 9 at 6:44











  • 1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
    – user626528
    Nov 9 at 15:21











  • 1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
    – Alex
    Nov 10 at 1:23














up vote
6
down vote













After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:



if (testStore)
counters[ch] = count + 1;


might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).



After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.



enter image description here



Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):



  1. Invert your i and j loop. This will remove the allEmpty variable completely.

  2. Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

  3. Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

  4. As this is a "test setting" which the name suggests, is this important at all? Just remove it.


EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:



foreach (var val in vals)

foreach (int ch in val)

var count = counters[ch];
tmp ^= count;
if (testStore)

counters[ch] = count + 1;





I come from runtimes like this:



enter image description here



to runtimes like this:



enter image description here



Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.




EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:



for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];


you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.



In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).



This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.






share|improve this answer






















  • "might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
    – user626528
    Nov 9 at 6:22






  • 1




    In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
    – Alex
    Nov 9 at 6:42






  • 1




    Nice analysis @Alex.
    – mjwills
    Nov 9 at 6:44











  • 1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
    – user626528
    Nov 9 at 15:21











  • 1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
    – Alex
    Nov 10 at 1:23












up vote
6
down vote










up vote
6
down vote









After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:



if (testStore)
counters[ch] = count + 1;


might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).



After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.



enter image description here



Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):



  1. Invert your i and j loop. This will remove the allEmpty variable completely.

  2. Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

  3. Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

  4. As this is a "test setting" which the name suggests, is this important at all? Just remove it.


EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:



foreach (var val in vals)

foreach (int ch in val)

var count = counters[ch];
tmp ^= count;
if (testStore)

counters[ch] = count + 1;





I come from runtimes like this:



enter image description here



to runtimes like this:



enter image description here



Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.




EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:



for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];


you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.



In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).



This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.






share|improve this answer














After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:



if (testStore)
counters[ch] = count + 1;


might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).



After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.



enter image description here



Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):



  1. Invert your i and j loop. This will remove the allEmpty variable completely.

  2. Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

  3. Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

  4. As this is a "test setting" which the name suggests, is this important at all? Just remove it.


EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:



foreach (var val in vals)

foreach (int ch in val)

var count = counters[ch];
tmp ^= count;
if (testStore)

counters[ch] = count + 1;





I come from runtimes like this:



enter image description here



to runtimes like this:



enter image description here



Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.




EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:



for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];


you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.



In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).



This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 9 at 7:01

























answered Nov 9 at 6:13









Alex

8201327




8201327











  • "might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
    – user626528
    Nov 9 at 6:22






  • 1




    In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
    – Alex
    Nov 9 at 6:42






  • 1




    Nice analysis @Alex.
    – mjwills
    Nov 9 at 6:44











  • 1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
    – user626528
    Nov 9 at 15:21











  • 1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
    – Alex
    Nov 10 at 1:23
















  • "might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
    – user626528
    Nov 9 at 6:22






  • 1




    In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
    – Alex
    Nov 9 at 6:42






  • 1




    Nice analysis @Alex.
    – mjwills
    Nov 9 at 6:44











  • 1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
    – user626528
    Nov 9 at 15:21











  • 1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
    – Alex
    Nov 10 at 1:23















"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22




"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22




1




1




In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42




In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42




1




1




Nice analysis @Alex.
– mjwills
Nov 9 at 6:44





Nice analysis @Alex.
– mjwills
Nov 9 at 6:44













1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21





1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21













1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23




1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219858%2fperformance-impact-of-array-writes-is-much-larger-than-expected%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌