How to implement and do OCR in a C# project?

How to implement and do OCR in a C# project?



I ve been searching for a while and all that i ve seen some OCR library requests. I would like to know how to implement the purest, easy to install and use OCR library with detailed info for installation into a C# project.



If posible, I just wanna implement it like a usual dll reference...



Example:


using org.pdfbox.pdmodel;
using org.pdfbox.util;



Also a little OCR code example would be nice, such as:


public string OCRFromBitmap(Bitmap Bmp)

Bmp.Save(temppath, System.Drawing.Imaging.ImageFormat.Tiff);
string OcrResult = Analyze(temppath);
File.Delete(temppath);
return OcrResult;



So please consider that I'm not familiar to OCR projects and give me an answer like talking to a dummy.



Edit:
I guess people misunderstood my request. I wanted to know how to implement those open source OCR libraries to a C# project and how to use them. The link given as dup is not giving answers that I requested at all.





@Polynomial as i mentioned up there this does not gives the answer i requested. Its not a duplicate it differs. I wanna see some coding, how to implement that library.. Please read the question carefully.
– Berker Yüceer
Jun 8 '12 at 10:56





5 Answers
5



Here's one: (check out http://hongouru.blogspot.ie/2011/09/c-ocr-optical-character-recognition.html or http://www.codeproject.com/Articles/41709/How-To-Use-Office-2007-OCR-Using-C for more info)


using MODI;
static void Main(string args)

DocumentClass myDoc = new DocumentClass();
myDoc.Create(@"theDocumentName.tiff"); //we work with the .tiff extension
myDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);

foreach (Image anImage in myDoc.Images)

Console.WriteLine(anImage.Layout.Text); //here we cout to the console.






How do I get MODI? I do have Microsoft Office 2010 & 2013 installed.
– mYnDstrEAm
Aug 16 '16 at 7:33





I have MS office but the references can't be resolved (they have the yellow warning triangle) and the project therefore won't build).
– Ewan
Mar 23 '17 at 13:49



If anyone is looking into this, I've been trying different options and the following approach yields very good results. The following are the steps to get a working example:


Install-Package Tesseract


tesseract-ocr-3.02.eng.tar.gz English language data for Tesseract 3.02


tessdata


Properties


System.Drawing


Samples


phototest.tif



Program.cs


using System;
using Tesseract;
using System.Diagnostics;

namespace ConsoleApplication

class Program

public static void Main(string args)

var testImagePath = "./phototest.tif";
if (args.Length > 0)

testImagePath = args[0];


try

var logger = new FormattedConsoleLogger();
var resultPrinter = new ResultPrinter(logger);
using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))

using (var img = Pix.LoadFromFile(testImagePath))

using (logger.Begin("Process image"))

var i = 1;
using (var page = engine.Process(img))

var text = page.GetText();
logger.Log("Text: 0", text);
logger.Log("Mean confidence: 0", page.GetMeanConfidence());

using (var iter = page.GetIterator())

iter.Begin();
do

if (i % 2 == 0)

using (logger.Begin("Line 0", i))

do

using (logger.Begin("Word Iteration"))

if (iter.IsAtBeginningOf(PageIteratorLevel.Block))

logger.Log("New block");

if (iter.IsAtBeginningOf(PageIteratorLevel.Para))

logger.Log("New paragraph");

if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))

logger.Log("New line");

logger.Log("word: " + iter.GetText(PageIteratorLevel.Word));

while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));


i++;
while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));






catch (Exception e)

Trace.TraceError(e.ToString());
Console.WriteLine("Unexpected Error: " + e.Message);
Console.WriteLine("Details: ");
Console.WriteLine(e.ToString());

Console.Write("Press any key to continue . . . ");
Console.ReadKey(true);




private class ResultPrinter

readonly FormattedConsoleLogger logger;

public ResultPrinter(FormattedConsoleLogger logger)

this.logger = logger;


public void Print(ResultIterator iter)

logger.Log("Is beginning of block: 0", iter.IsAtBeginningOf(PageIteratorLevel.Block));
logger.Log("Is beginning of para: 0", iter.IsAtBeginningOf(PageIteratorLevel.Para));
logger.Log("Is beginning of text line: 0", iter.IsAtBeginningOf(PageIteratorLevel.TextLine));
logger.Log("Is beginning of word: 0", iter.IsAtBeginningOf(PageIteratorLevel.Word));
logger.Log("Is beginning of symbol: 0", iter.IsAtBeginningOf(PageIteratorLevel.Symbol));

logger.Log("Block text: "0"", iter.GetText(PageIteratorLevel.Block));
logger.Log("Para text: "0"", iter.GetText(PageIteratorLevel.Para));
logger.Log("TextLine text: "0"", iter.GetText(PageIteratorLevel.TextLine));
logger.Log("Word text: "0"", iter.GetText(PageIteratorLevel.Word));
logger.Log("Symbol text: "0"", iter.GetText(PageIteratorLevel.Symbol));






FormattedConsoleLogger.cs


using System;
using System.Collections.Generic;
using System.Text;
using Tesseract;

namespace ConsoleApplication

public class FormattedConsoleLogger

const string Tab = " ";
private class Scope : DisposableBase

private int indentLevel;
private string indent;
private FormattedConsoleLogger container;

public Scope(FormattedConsoleLogger container, int indentLevel)

this.container = container;
this.indentLevel = indentLevel;
StringBuilder indent = new StringBuilder();
for (int i = 0; i < indentLevel; i++)

indent.Append(Tab);

this.indent = indent.ToString();


public void Log(string format, object args)

var message = String.Format(format, args);
StringBuilder indentedMessage = new StringBuilder(message.Length + indent.Length * 10);
int i = 0;
bool isNewLine = true;
while (i < message.Length)


Console.WriteLine(indentedMessage.ToString());



public Scope Begin()

return new Scope(container, indentLevel + 1);


protected override void Dispose(bool disposing)

if (disposing)

var scope = container.scopes.Pop();
if (scope != this)

throw new InvalidOperationException("Format scope removed out of order.");





private Stack<Scope> scopes = new Stack<Scope>();

public IDisposable Begin(string title = "", params object args)

Log(title, args);
Scope scope;
if (scopes.Count == 0)

scope = new Scope(this, 1);

else

scope = ActiveScope.Begin();

scopes.Push(scope);
return scope;


public void Log(string format, params object args)

if (scopes.Count > 0)

ActiveScope.Log(format, args);

else

Console.WriteLine(String.Format(format, args));



private Scope ActiveScope

get

var top = scopes.Peek();
if (top == null) throw new InvalidOperationException("No current scope");
return top;








I wish I could vote more than once because this is such a good instruction to get that thing running.
– BloodyRain2k
Dec 25 '15 at 19:06





@BloodyRain2k I'm glad that you found it useful. Thank you for the kind words.
– B.K.
Dec 25 '15 at 22:29





I used the link you mentioned above. In eng folder (github.com/tesseract-ocr/langdata/tree/master/eng) One file is missing i.e. eng.traineddata. Please add this file too.
– Mughees Musaddiq
Feb 26 '16 at 15:09





@MugheesMusaddiq They keep on changing the files a lot, that's why I was reluctant to put any links, as they're not guaranteed to be the same down the line. This is meant more as a guide on how to get started and the lack of link guarantee is why I've pasted so much code here.
– B.K.
Feb 27 '16 at 1:25





Old versions of the language data can be downloaded here: sourceforge.net/projects/tesseract-ocr-alt/files (e.g. because as of right now the NuGet package is of version 3.02 and the only language data available on the site linked bove is 3.04; alternatively the Wayback Machine can be used)
– mYnDstrEAm
Aug 16 '16 at 7:35




I'm using tesseract OCR engine with TessNet2 (a C# wrapper - http://www.pixel-technology.com/freeware/tessnet2/).



Some basic code:


using tessnet2;



...


Bitmap image = new Bitmap(@"u:user filesbwalker2849257.tif");
tessnet2.Tesseract ocr = new tessnet2.Tesseract();
ocr.SetVariable("tessedit_char_whitelist", "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.,$-/#&=()"':?"); // Accepted characters
ocr.Init(@"C:UsersbwalkerDocumentsVisual Studio 2010ProjectstessnetWinFormstessnetWinFormsbinRelease", "eng", false); // Directory of your tessdata folder
List<tessnet2.Word> result = ocr.DoOCR(image, System.Drawing.Rectangle.Empty);
string Results = "";
foreach (tessnet2.Word word in result)

Results += word.Confidence + ", " + word.Text + ", " + word.Left + ", " + word.Top + ", " + word.Bottom + ", " + word.Right + "n";





In your link, there is another link "Download binary here" and it doesn't work. In fact this link is on many websites and it doesn't work on any of them. Does anyone know where the tessnet2.dll can be downloaded from?
– Ewan
Mar 22 '17 at 13:13





I actually found tessnet2 in NuGet, not sure why I didn't look there first. It stops on the ocr.Init line when I run it though.Is there meant to be something specific in that directory? tessnet2_32.dll is in my "tessdata" folder as is my application exe file. Any idea why it stops? It simply doesn't do anything.
– Ewan
Mar 23 '17 at 15:30




There is a .NET wrapper for Tesseract 3.01: https://github.com/charlesw/tesseract-ocr-dotnet



Another Option to this is to use Neevia Document Converter which has inbuilt OCR capability. You can run pretty much any file type and it will product a pdf that is essential a big text document, which you can then open and search through using ITextSharper






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)