Effieciently validate multiple DataRows with return early option
Effieciently validate multiple DataRows with return early option
I'm looking for an efficient way to validate multiple rows originating from multiple .csv files. Below is the current code to accomplish that.
The validation logic is in the function ValidateRow(DataRow row)
(not presented here).
ValidateRow(DataRow row)
I could think of an alternative with using Linq and performing batch.All( row => ValidateRow(row))
and adding the row to the InvalidRow
property in ValidateRow
. But I don't know if that is more efficient or not.
batch.All( row => ValidateRow(row))
InvalidRow
ValidateRow
public bool ValidateBatch( IEnumerable<DataRow> batch, bool exitOnFirstError = false )
batch?.Count() == 0 ) logger.Error( MSG_NO_DATAROWS ); return false;
bool result = true;
foreach( DataRow row in batch )
if( !ValidateRow( row ) )
result = false;
// track invalid rows on a property (List<DataRow>)
InvalidRows.Add( row );
if( exitOnFirstError )
return result;
return result;
3 Answers
3
The if
condition
if
if( batch == null || batch?.Count() == 0 ) logger.Error( MSG_NO_DATAROWS ); return false;
has some flaws:
the Null-Conditional operator ?
on batch?.Count() == 0
isn't needed, because you already check for batch == null
?
batch?.Count() == 0
batch == null
If the passed IEnumerable<DataRow>
isn't a Collection<T>
the IEnumerable<DataRow>
will be enumerated completely for calculating the Count
.
Assume you have millions of items in that IEnumerable<T>
guess how much time it takes to iterate twice (once for Count()
and once for the foreach
).
IEnumerable<DataRow>
Collection<T>
IEnumerable<DataRow>
Count
IEnumerable<T>
Count()
foreach
Having multiple statements in a single line reduces the readability of the code.
A better way would be to use the Any()
method like so
Any()
if( batch == null || !batch.Any())
logger.Error( MSG_NO_DATAROWS );
return false;
?.
batch == null
I would encourage you to take a step back and look at how this code will be used. Here's what I noticed as I read the code.
The definition of a valid batch seems to be "A batch with at least one valid row, and no invalid ones". I would have guessed that a valid batch would simply be one with no invalid rows, and that it would be checked for emptiness elsewhere. That doesn't mean the definition is wrong, just a bit surprising - and it's a good idea to avoid surprises in your code base.
It seems like your function has two purposes: One is to determine, as quickly as possible, whether a batch is valid. The other is to collect a list of the invalid rows. This sounds like it might be best as two separate functions (in fact, "should this be two functions" is my first question whenever I see a function with a bool
parameter).
bool
This implementation treats the collection of invalid rows and the logging of empty batches as side effects of the validation code. This isn't necessarily a problem, but it can be inconvenient. Suppose you want to validate multiple batches in parallel - you can't, because this.InvalidRows
, as a List
, is not thread-safe. Suppose you want to test your validation code without touching the system logs - you can't*, because the call to logger.Error
takes place directly in the validation code.
this.InvalidRows
List
logger.Error
*without injecting the logger dependency
With that in mind, you could make a one-line function for the first use case, leveraging Linq's All
, which does short-circuit for efficiency:
All
public bool BatchIsValid(IEnumerable<DataRow> batch)
return batch.All(ValidateRow);
And another one-liner for the second use case:
public IEnumerable<DataRow> InvalidRowsIn(IEnumerable<DataRow> batch)
return batch.Where(row => !ValidateRow(row));
And, if you account for empty batches from your calling code, that may be all you need. Here's a third one-liner you could use for that (As Heslacher pointed out, Any
also short-circuits for efficiency):
Any
public bool BatchIsEmpty(IEnumerable<DataRow> batch)
return batch == null
In fact, I usually like to make one-line, one-argument functions into lambda-bodied extension methods. Here's what that might look like:
// In an appropriate namespace
public static class ValidationExtensions
public static bool IsValid(this IEnumerable<DataRow> batch)
=> batch.All(row => row.IsValid());
public static IEnumerable<DataRow> InvalidRows(this IEnumerable<DataRow> batch)
=> batch.Where(row => !row.IsValid());
public static bool IsEmpty(this IEnumerable<DataRow> batch)
=> batch is null
And that might make your calling code look like this (I am guessing here at what the larger context may be):
public void Process(IEnumerable<DataRow> batch)
if (batch.IsEmpty())
logger.Error(MSG_NO_DATAROWS);
return;
if (!batch.IsValid())
foreach (var row in batch.InvalidRows())
LogDataError(row);
return;
foreach (var row in batch)
Process(row);
BUT at this point you might ask what you're gaining from those extension methods. Why not just use Linq's All
, Where
, and Any
directly? And indeed, that looks about the same (and without hiding any logic in extension methods):
All
Where
Any
public void Process(IEnumerable<DataRow> batch)
if (batch is null
This is the track I usually take with these sorts of functions: Very rarely do I find it's "worth it" to filter a list or find a list element within my own foreach
or function. The only reason I might go that direction is for absolute efficiency, when it's very important that I iterate the list no more than once (perhaps because I am lazily streaming a very large amount of data). That might look like this:
foreach
public bool Process(IEnumerable<DataRow> batch, out List<DataRow> invalidRows)
if (batch is null)
logger.Error(MSG_NO_DATAROWS);
return;
bool allRowsAreValid = true;
bool batchIsEmpty = true;
invalidRows = new List<DataRow>();
foreach (var row in batch)
batchIsEmpty = false;
if (IsValid(row))
Process(Row);
else
allRowsAreValid = false;
invalidRows.Add(row);
if (batchIsEmpty)
logger.Error(MSG_NO_DATAROWS);
return;
return allRowsAreValid;
Note that this solution uses no Linq. Everything (empty checks, logging, validation, collecting invalid rows, processing) is integrated because that's the most efficient way to do it. The next measures to improve efficiency would cause even bigger hits to the quality of life of the maintenance programmer: things like switching to a while
loop directly on your DataReader
, or switching batchIsEmpty
from bool
to int
so that the body of the foreach
could be made into a switch
, suggesting to the compiler that a jump table should be used to avoid touching that variable after the first iteration...
while
DataReader
batchIsEmpty
bool
int
foreach
switch
I hope you also note that this solution is more difficult to read. It's up to you to choose the best balance between efficiency and separation of concerns.
$begingroup$
Thank you for this great answer! I don't know about switching the accepted answer afterwards, so it stays where it is but I just want you to know this helped a lot too. Regarding using the IsValid call in seperate (one-line) functions: I would prefer it that way instead of your other consideration to make the calls inline with LINQ where the rows are processed. First to decouple the validator and processor (Seperation of concerns) and second to not bloat that method anymore (there is happening a lot already).
$endgroup$
– ftp.x32
Sep 12 '18 at 11:09
$begingroup$
Part 2: I understand why you would seperate the collection of invalid rows and the validation of them. But there are possibly large datasets in validation so I would want to avoid walking those collection twice
$endgroup$
– ftp.x32
Sep 12 '18 at 11:11
$begingroup$
That makes perfect sense; having a large data set is a good reason to approach the code differently. I'm glad you found this helpful!
$endgroup$
– benj2240
Sep 12 '18 at 15:07
if you follow the logic of the code that you have presented it does not matter whether you exit on the first error or not, after the first error the result
variable will always return false, so there is no need to keep looping regardless of the value of exitOnFirstError
except if you want to keep validating rows.
result
exitOnFirstError
Personally I would make this a void method That calls the ValidateRow
method, but, and here is the big thing to keep in mind, the ValidateRow()
method should be the one calling InvalidRows.Add( row );
for rows that are not valid
ValidateRow
ValidateRow()
InvalidRows.Add( row );
let's say that we don't change the return types of any of the methods listed here it would look like this:
public bool ValidateBatch( IEnumerable<DataRow> batch, bool exitOnFirstError = false )
batch?.Count() == 0 )
logger.Error( MSG_NO_DATAROWS );
return false;
bool result = true;
foreach( DataRow row in batch )
if( !ValidateRow( row ) )
result = false;
if( exitOnFirstError )
return result;
return result;
I also took your code and formatted it according to common formatting standards for C#, you should not one line a multi statement if block, it makes it hard to read.
Thanks for contributing an answer to Code Review Stack Exchange!
But avoid …
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
$begingroup$
I would have been tempted to keep the Null-Conditional (
?.
) and dropped the null check (batch == null
).$endgroup$
– Brian J
Sep 11 '18 at 19:38