Performance impact of array writes is much larger than expected

up vote
1
down vote

favorite

I've stumbled upon this effect when debugging an application - see the repro code below.

It gives me the following results:

Data init, count: 100,000 x 10,000, 4.6133365 secs Perf test 0 (False): 5.8289565 secs Perf test 0 (True): 5.8485172 secs Perf test 1 (False): 32.3222312 secs Perf test 1 (True): 217.0089923 secs

As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?

UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.

class Program

 static void Main(string args)
 
 var data = InitData();

 TestPerf0(data, false);
 TestPerf0(data, true);

 TestPerf1(data, false);
 TestPerf1(data, true);

 if (Debugger.IsAttached)
 Console.ReadKey();
 

 private static string InitData()
 
 var watch = Stopwatch.StartNew();

 var data = new string[100_000];
 var maxString = 10_000;

 for (int i = 0; i < data.Length; i++)
 
 data[i] = new string('-', maxString);
 

 watch.Stop();
 Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

 return data;
 

 private static void TestPerf1(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[char.MaxValue];
 int tmp = 0;

 for (var j = 0; ; j++)
 
 var allEmpty = true;

 for (var i = 0; i < vals.Length; i++)
 
 var val = vals[i];

 if (j < val.Length)
 
 allEmpty = false;

 var ch = val[j];
 var count = counters[ch];
 tmp ^= count;

 if (testStore)
 counters[ch] = count + 1;
 
 

 if (allEmpty)
 break;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
 

 private static void TestPerf0(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[65536];
 int tmp = 0;

 for (var i = 0; i < 1_000_000_000; i++)
 
 var j = i % counters.Length;
 var count = counters[j];
 tmp ^= count;

 if (testStore)
 counters[j] = count + 1;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");

edited Nov 11 at 3:40

asked Nov 9 at 4:24

user626528

6,2842057117

Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
– Alex
Nov 9 at 4:35

@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38

When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56

3

@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02

2

@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
– Alex
Nov 9 at 7:10

|
show 2 more comments

up vote
1
down vote

favorite

I've stumbled upon this effect when debugging an application - see the repro code below.

It gives me the following results:

Data init, count: 100,000 x 10,000, 4.6133365 secs Perf test 0 (False): 5.8289565 secs Perf test 0 (True): 5.8485172 secs Perf test 1 (False): 32.3222312 secs Perf test 1 (True): 217.0089923 secs

As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?

UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.

class Program

 static void Main(string args)
 
 var data = InitData();

 TestPerf0(data, false);
 TestPerf0(data, true);

 TestPerf1(data, false);
 TestPerf1(data, true);

 if (Debugger.IsAttached)
 Console.ReadKey();
 

 private static string InitData()
 
 var watch = Stopwatch.StartNew();

 var data = new string[100_000];
 var maxString = 10_000;

 for (int i = 0; i < data.Length; i++)
 
 data[i] = new string('-', maxString);
 

 watch.Stop();
 Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

 return data;
 

 private static void TestPerf1(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[char.MaxValue];
 int tmp = 0;

 for (var j = 0; ; j++)
 
 var allEmpty = true;

 for (var i = 0; i < vals.Length; i++)
 
 var val = vals[i];

 if (j < val.Length)
 
 allEmpty = false;

 var ch = val[j];
 var count = counters[ch];
 tmp ^= count;

 if (testStore)
 counters[ch] = count + 1;
 
 

 if (allEmpty)
 break;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
 

 private static void TestPerf0(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[65536];
 int tmp = 0;

 for (var i = 0; i < 1_000_000_000; i++)
 
 var j = i % counters.Length;
 var count = counters[j];
 tmp ^= count;

 if (testStore)
 counters[j] = count + 1;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");

edited Nov 11 at 3:40

asked Nov 9 at 4:24

user626528

6,2842057117

Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
– Alex
Nov 9 at 4:35

@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38

When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56

3

@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02

2

@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
– Alex
Nov 9 at 7:10

|
show 2 more comments

up vote
1
down vote

favorite

I've stumbled upon this effect when debugging an application - see the repro code below.

It gives me the following results:

Data init, count: 100,000 x 10,000, 4.6133365 secs Perf test 0 (False): 5.8289565 secs Perf test 0 (True): 5.8485172 secs Perf test 1 (False): 32.3222312 secs Perf test 1 (True): 217.0089923 secs

As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?

UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.

class Program

 static void Main(string args)
 
 var data = InitData();

 TestPerf0(data, false);
 TestPerf0(data, true);

 TestPerf1(data, false);
 TestPerf1(data, true);

 if (Debugger.IsAttached)
 Console.ReadKey();
 

 private static string InitData()
 
 var watch = Stopwatch.StartNew();

 var data = new string[100_000];
 var maxString = 10_000;

 for (int i = 0; i < data.Length; i++)
 
 data[i] = new string('-', maxString);
 

 watch.Stop();
 Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

 return data;
 

 private static void TestPerf1(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[char.MaxValue];
 int tmp = 0;

 for (var j = 0; ; j++)
 
 var allEmpty = true;

 for (var i = 0; i < vals.Length; i++)
 
 var val = vals[i];

 if (j < val.Length)
 
 allEmpty = false;

 var ch = val[j];
 var count = counters[ch];
 tmp ^= count;

 if (testStore)
 counters[ch] = count + 1;
 
 

 if (allEmpty)
 break;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
 

 private static void TestPerf0(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[65536];
 int tmp = 0;

 for (var i = 0; i < 1_000_000_000; i++)
 
 var j = i % counters.Length;
 var count = counters[j];
 tmp ^= count;

 if (testStore)
 counters[j] = count + 1;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");

edited Nov 11 at 3:40

asked Nov 9 at 4:24

user626528

6,2842057117

I've stumbled upon this effect when debugging an application - see the repro code below.

It gives me the following results:

Data init, count: 100,000 x 10,000, 4.6133365 secs Perf test 0 (False): 5.8289565 secs Perf test 0 (True): 5.8485172 secs Perf test 1 (False): 32.3222312 secs Perf test 1 (True): 217.0089923 secs

As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?

UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.

class Program

 static void Main(string args)
 
 var data = InitData();

 TestPerf0(data, false);
 TestPerf0(data, true);

 TestPerf1(data, false);
 TestPerf1(data, true);

 if (Debugger.IsAttached)
 Console.ReadKey();
 

 private static string InitData()
 
 var watch = Stopwatch.StartNew();

 var data = new string[100_000];
 var maxString = 10_000;

 for (int i = 0; i < data.Length; i++)
 
 data[i] = new string('-', maxString);
 

 watch.Stop();
 Console.WriteLine($"Data init, count: data.Length:n0 x maxString:n0, watch.Elapsed.TotalSeconds secs");

 return data;
 

 private static void TestPerf1(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[char.MaxValue];
 int tmp = 0;

 for (var j = 0; ; j++)
 
 var allEmpty = true;

 for (var i = 0; i < vals.Length; i++)
 
 var val = vals[i];

 if (j < val.Length)
 
 allEmpty = false;

 var ch = val[j];
 var count = counters[ch];
 tmp ^= count;

 if (testStore)
 counters[ch] = count + 1;
 
 

 if (allEmpty)
 break;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 1 (testStore): watch.Elapsed.TotalSeconds secs");
 

 private static void TestPerf0(string vals, bool testStore)
 
 var watch = Stopwatch.StartNew();

 var counters = new int[65536];
 int tmp = 0;

 for (var i = 0; i < 1_000_000_000; i++)
 
 var j = i % counters.Length;
 var count = counters[j];
 tmp ^= count;

 if (testStore)
 counters[j] = count + 1;
 

 // prevent the compiler from optimizing away our computations
 tmp.GetHashCode();

 watch.Stop();
 Console.WriteLine($"Perf test 0 (testStore): watch.Elapsed.TotalSeconds secs");

c# .net performance

edited Nov 11 at 3:40

asked Nov 9 at 4:24

user626528

6,2842057117

edited Nov 11 at 3:40

asked Nov 9 at 4:24

user626528

6,2842057117

edited Nov 11 at 3:40

asked Nov 9 at 4:24

user626528

6,2842057117

asked Nov 9 at 4:24

user626528

6,2842057117

asked Nov 9 at 4:24

user626528

6,2842057117

Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
– Alex
Nov 9 at 4:35

@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38

When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56

3

@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02

2

@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
– Alex
Nov 9 at 7:10

|
show 2 more comments

Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
– Alex
Nov 9 at 4:35

@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38

When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56

3

@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02

2

@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
– Alex
Nov 9 at 7:10

Not sure what your steps are. Can you be more specific what you did to prevent the compiler from optimizing your code? Can you return (or better output) the hash of tmp and check the time again?
– Alex
Nov 9 at 4:35

@Alex tmp.GetHashCode() is enough to prevent the compiler and runtime from optimizing away the code that computes the value of tmp. Actually, you can try to comment out the line var ch = val[j]; and a few lines below, and see the real difference when the code is not there.
– user626528
Nov 9 at 4:38

When you say you've "stumbled upon this effect when debugging an application" do you mean that you see the time difference when the program is run under the debugger and do not see it when run outside the debugger?
– Michael Burr
Nov 9 at 4:56

@user626528: I tested your code with optimizations turned off in the settings (just to be sure) and I can confirm it has nothing to do with it. My best guess at the moment would be cache flickering or less performant instructions at all.
– Alex
Nov 9 at 5:02

@BrianWhite: This should be a symmetrical issue. If GC kicks in it should be in both cases (for the PerfTest1 function) no matter if testStore is true or false.
– Alex
Nov 9 at 7:10

|
show 2 more comments

1 Answer
1

active

oldest

votes

up vote
6
down vote

After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:

if (testStore)
 counters[ch] = count + 1;

might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).

After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.

enter image description here

Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):

Invert your i and j loop. This will remove the allEmpty variable completely.

Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

As this is a "test setting" which the name suggests, is this important at all? Just remove it.

EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:

foreach (var val in vals)

 foreach (int ch in val)
 
 var count = counters[ch];
 tmp ^= count;
 if (testStore)
 
 counters[ch] = count + 1;

I come from runtimes like this:

enter image description here

to runtimes like this:

enter image description here

Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.

EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:

for (var i = 0; i < vals.Length; i++)
{
 var val = vals[i];

you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.

In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).

This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

edited Nov 9 at 7:01

answered Nov 9 at 6:13

Alex

8201327

"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22

1

In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42

1

Nice analysis @Alex.
– mjwills
Nov 9 at 6:44

1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21

1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219858%2fperformance-impact-of-array-writes-is-much-larger-than-expected%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
6
down vote

After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:

if (testStore)
 counters[ch] = count + 1;

After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.

enter image description here

Invert your i and j loop. This will remove the allEmpty variable completely.

Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

As this is a "test setting" which the name suggests, is this important at all? Just remove it.

EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:

foreach (var val in vals)

 foreach (int ch in val)
 
 var count = counters[ch];
 tmp ^= count;
 if (testStore)
 
 counters[ch] = count + 1;

I come from runtimes like this:

enter image description here

to runtimes like this:

enter image description here

EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:

for (var i = 0; i < vals.Length; i++)
{
 var val = vals[i];

This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

edited Nov 9 at 7:01

answered Nov 9 at 6:13

Alex

8201327

"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22

1

In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42

1

Nice analysis @Alex.
– mjwills
Nov 9 at 6:44

1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21

1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23

add a comment |

up vote
6
down vote

After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:

if (testStore)
 counters[ch] = count + 1;

After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.

enter image description here

Invert your i and j loop. This will remove the allEmpty variable completely.

Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

As this is a "test setting" which the name suggests, is this important at all? Just remove it.

EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:

foreach (var val in vals)

 foreach (int ch in val)
 
 var count = counters[ch];
 tmp ^= count;
 if (testStore)
 
 counters[ch] = count + 1;

I come from runtimes like this:

enter image description here

to runtimes like this:

enter image description here

EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:

for (var i = 0; i < vals.Length; i++)
{
 var val = vals[i];

This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

edited Nov 9 at 7:01

answered Nov 9 at 6:13

Alex

8201327

"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22

1

In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42

1

Nice analysis @Alex.
– mjwills
Nov 9 at 6:44

1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21

1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23

add a comment |

up vote
6
down vote

After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:

if (testStore)
 counters[ch] = count + 1;

After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.

enter image description here

Invert your i and j loop. This will remove the allEmpty variable completely.

Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

As this is a "test setting" which the name suggests, is this important at all? Just remove it.

EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:

foreach (var val in vals)

 foreach (int ch in val)
 
 var count = counters[ch];
 tmp ^= count;
 if (testStore)
 
 counters[ch] = count + 1;

I come from runtimes like this:

enter image description here

to runtimes like this:

enter image description here

EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:

for (var i = 0; i < vals.Length; i++)
{
 var val = vals[i];

This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

edited Nov 9 at 7:01

answered Nov 9 at 6:13

Alex

8201327

After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:

if (testStore)
 counters[ch] = count + 1;

After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.

enter image description here

Invert your i and j loop. This will remove the allEmpty variable completely.

Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.

Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").

As this is a "test setting" which the name suggests, is this important at all? Just remove it.

EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:

foreach (var val in vals)

 foreach (int ch in val)
 
 var count = counters[ch];
 tmp ^= count;
 if (testStore)
 
 counters[ch] = count + 1;

I come from runtimes like this:

enter image description here

to runtimes like this:

enter image description here

EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:

for (var i = 0; i < vals.Length; i++)
{
 var val = vals[i];

This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

edited Nov 9 at 7:01

answered Nov 9 at 6:13

Alex

8201327

edited Nov 9 at 7:01

answered Nov 9 at 6:13

Alex

8201327

answered Nov 9 at 6:13

Alex

8201327

answered Nov 9 at 6:13

Alex

8201327

"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22

1

In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42

1

Nice analysis @Alex.
– mjwills
Nov 9 at 6:44

1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21

1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23

add a comment |

"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22

1

In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42

1

Nice analysis @Alex.
– mjwills
Nov 9 at 6:44

1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21

1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23

"might be force the compiler to completely load a new cache-line into the memory and displace the current content." - why? In depth answer is what I'm actually looking for, rather than just a guess :) "Invert your i and j loop. This will remove the allEmpty variable completely." - there is a very good reason why this search (in the real app) goes breadth-first rather than depth-first.
– user626528
Nov 9 at 6:22

In-depth is beyond impossible to answer if the exact hardware and other side-effects are unknown. I don't know anything about your "good-reasons" there might be but please see my edit according to this.
– Alex
Nov 9 at 6:42

Nice analysis @Alex.
– mjwills
Nov 9 at 6:44

1) this algorithm (a variation of bucket/radix sort, if this helps you) doesn't work depth-first, so inverting the loops is just out of question 2) cache thrashing should be happening independently of what we're doing - reading or writing, shouldn't it?
– user626528
Nov 9 at 15:21

1) Ok, so the order is set by the algorithm. 2) This also occurs when you change the write to a read access (eg. if(testStore) dummy = counters[ch] + 1;). I still would assume that the "last" access to counters (regardless the direction) hinders the CPU/runtime/compiler (what so ever) from using better caching strategies or pre-caching sets. Without more context it's hard to tell or make better suggestions to solve your problem.
– Alex
Nov 10 at 1:23

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt