Effieciently validate multiple DataRows with return early option

I'm looking for an efficient way to validate multiple rows originating from multiple .csv files. Below is the current code to accomplish that.

The validation logic is in the function ValidateRow(DataRow row) (not presented here).

ValidateRow(DataRow row)

I could think of an alternative with using Linq and performing batch.All( row => ValidateRow(row)) and adding the row to the InvalidRow property in ValidateRow. But I don't know if that is more efficient or not.

batch.All( row => ValidateRow(row))

InvalidRow

ValidateRow

public bool ValidateBatch( IEnumerable<DataRow> batch, bool exitOnFirstError = false ) batch?.Count() == 0 ) logger.Error( MSG_NO_DATAROWS ); return false; bool result = true; foreach( DataRow row in batch ) if( !ValidateRow( row ) ) result = false; // track invalid rows on a property (List<DataRow>) InvalidRows.Add( row ); if( exitOnFirstError ) return result; return result;

3 Answers
3

The if condition

if

if( batch == null || batch?.Count() == 0 ) logger.Error( MSG_NO_DATAROWS ); return false;

has some flaws:

the Null-Conditional operator ? on batch?.Count() == 0 isn't needed, because you already check for batch == null

?

batch?.Count() == 0

batch == null

If the passed IEnumerable<DataRow> isn't a Collection<T> the IEnumerable<DataRow> will be enumerated completely for calculating the Count.

Assume you have millions of items in that IEnumerable<T> guess how much time it takes to iterate twice (once for Count() and once for the foreach).

IEnumerable<DataRow>

Collection<T>

IEnumerable<DataRow>

Count

IEnumerable<T>

Count()

foreach

Having multiple statements in a single line reduces the readability of the code.

A better way would be to use the Any() method like so

Any()

if( batch == null || !batch.Any()) logger.Error( MSG_NO_DATAROWS ); return false;

I would have been tempted to keep the Null-Conditional (?.) and dropped the null check (batch == null).

– Brian J
Sep 11 '18 at 19:38

?.

batch == null

I would encourage you to take a step back and look at how this code will be used. Here's what I noticed as I read the code.

The definition of a valid batch seems to be "A batch with at least one valid row, and no invalid ones". I would have guessed that a valid batch would simply be one with no invalid rows, and that it would be checked for emptiness elsewhere. That doesn't mean the definition is wrong, just a bit surprising - and it's a good idea to avoid surprises in your code base.

It seems like your function has two purposes: One is to determine, as quickly as possible, whether a batch is valid. The other is to collect a list of the invalid rows. This sounds like it might be best as two separate functions (in fact, "should this be two functions" is my first question whenever I see a function with a bool parameter).

bool

This implementation treats the collection of invalid rows and the logging of empty batches as side effects of the validation code. This isn't necessarily a problem, but it can be inconvenient. Suppose you want to validate multiple batches in parallel - you can't, because this.InvalidRows, as a List, is not thread-safe. Suppose you want to test your validation code without touching the system logs - you can't^*, because the call to logger.Error takes place directly in the validation code.

this.InvalidRows

List

logger.Error

^{*without injecting the logger dependency}

With that in mind, you could make a one-line function for the first use case, leveraging Linq's All, which does short-circuit for efficiency:

All

public bool BatchIsValid(IEnumerable<DataRow> batch) return batch.All(ValidateRow);

And another one-liner for the second use case:

public IEnumerable<DataRow> InvalidRowsIn(IEnumerable<DataRow> batch) return batch.Where(row => !ValidateRow(row));

And, if you account for empty batches from your calling code, that may be all you need. Here's a third one-liner you could use for that (As Heslacher pointed out, Any also short-circuits for efficiency):

Any

public bool BatchIsEmpty(IEnumerable<DataRow> batch) return batch == null

In fact, I usually like to make one-line, one-argument functions into lambda-bodied extension methods. Here's what that might look like:

// In an appropriate namespace public static class ValidationExtensions public static bool IsValid(this IEnumerable<DataRow> batch) => batch.All(row => row.IsValid()); public static IEnumerable<DataRow> InvalidRows(this IEnumerable<DataRow> batch) => batch.Where(row => !row.IsValid()); public static bool IsEmpty(this IEnumerable<DataRow> batch) => batch is null

And that might make your calling code look like this (I am guessing here at what the larger context may be):

public void Process(IEnumerable<DataRow> batch) if (batch.IsEmpty()) logger.Error(MSG_NO_DATAROWS); return; if (!batch.IsValid()) foreach (var row in batch.InvalidRows()) LogDataError(row); return; foreach (var row in batch) Process(row);

BUT at this point you might ask what you're gaining from those extension methods. Why not just use Linq's All, Where, and Any directly? And indeed, that looks about the same (and without hiding any logic in extension methods):

All

Where

Any

public void Process(IEnumerable<DataRow> batch) if (batch is null

This is the track I usually take with these sorts of functions: Very rarely do I find it's "worth it" to filter a list or find a list element within my own foreach or function. The only reason I might go that direction is for absolute efficiency, when it's very important that I iterate the list no more than once (perhaps because I am lazily streaming a very large amount of data). That might look like this:

foreach

public bool Process(IEnumerable<DataRow> batch, out List<DataRow> invalidRows) if (batch is null) logger.Error(MSG_NO_DATAROWS); return; bool allRowsAreValid = true; bool batchIsEmpty = true; invalidRows = new List<DataRow>(); foreach (var row in batch) batchIsEmpty = false; if (IsValid(row)) Process(Row); else allRowsAreValid = false; invalidRows.Add(row); if (batchIsEmpty) logger.Error(MSG_NO_DATAROWS); return; return allRowsAreValid;

Note that this solution uses no Linq. Everything (empty checks, logging, validation, collecting invalid rows, processing) is integrated because that's the most efficient way to do it. The next measures to improve efficiency would cause even bigger hits to the quality of life of the maintenance programmer: things like switching to a while loop directly on your DataReader, or switching batchIsEmpty from bool to int so that the body of the foreach could be made into a switch, suggesting to the compiler that a jump table should be used to avoid touching that variable after the first iteration...

while

DataReader

batchIsEmpty

bool

int

foreach

switch

I hope you also note that this solution is more difficult to read. It's up to you to choose the best balance between efficiency and separation of concerns.

Thank you for this great answer! I don't know about switching the accepted answer afterwards, so it stays where it is but I just want you to know this helped a lot too. Regarding using the IsValid call in seperate (one-line) functions: I would prefer it that way instead of your other consideration to make the calls inline with LINQ where the rows are processed. First to decouple the validator and processor (Seperation of concerns) and second to not bloat that method anymore (there is happening a lot already).

– ftp.x32
Sep 12 '18 at 11:09

Part 2: I understand why you would seperate the collection of invalid rows and the validation of them. But there are possibly large datasets in validation so I would want to avoid walking those collection twice

– ftp.x32
Sep 12 '18 at 11:11

That makes perfect sense; having a large data set is a good reason to approach the code differently. I'm glad you found this helpful!

– benj2240
Sep 12 '18 at 15:07

if you follow the logic of the code that you have presented it does not matter whether you exit on the first error or not, after the first error the result variable will always return false, so there is no need to keep looping regardless of the value of exitOnFirstError except if you want to keep validating rows.

result

exitOnFirstError

Personally I would make this a void method That calls the ValidateRow method, but, and here is the big thing to keep in mind, the ValidateRow() method should be the one calling InvalidRows.Add( row ); for rows that are not valid

ValidateRow

ValidateRow()

InvalidRows.Add( row );

let's say that we don't change the return types of any of the methods listed here it would look like this:

public bool ValidateBatch( IEnumerable<DataRow> batch, bool exitOnFirstError = false ) batch?.Count() == 0 ) logger.Error( MSG_NO_DATAROWS ); return false; bool result = true; foreach( DataRow row in batch ) if( !ValidateRow( row ) ) result = false; if( exitOnFirstError ) return result; return result;

I also took your code and formatted it according to common formatting standards for C#, you should not one line a multi statement if block, it makes it hard to read.

Thanks for contributing an answer to Code Review Stack Exchange!

But avoid …

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt