How to implement and do OCR in a C# project?

I ve been searching for a while and all that i ve seen some OCR library requests. I would like to know how to implement the purest, easy to install and use OCR library with detailed info for installation into a C# project.

If posible, I just wanna implement it like a usual dll reference...

Example:

using org.pdfbox.pdmodel; using org.pdfbox.util;

Also a little OCR code example would be nice, such as:

public string OCRFromBitmap(Bitmap Bmp) Bmp.Save(temppath, System.Drawing.Imaging.ImageFormat.Tiff); string OcrResult = Analyze(temppath); File.Delete(temppath); return OcrResult;

So please consider that I'm not familiar to OCR projects and give me an answer like talking to a dummy.

Edit:
I guess people misunderstood my request. I wanted to know how to implement those open source OCR libraries to a C# project and how to use them. The link given as dup is not giving answers that I requested at all.

@Polynomial as i mentioned up there this does not gives the answer i requested. Its not a duplicate it differs. I wanna see some coding, how to implement that library.. Please read the question carefully.
– Berker Yüceer
Jun 8 '12 at 10:56

5 Answers
5

Here's one: (check out http://hongouru.blogspot.ie/2011/09/c-ocr-optical-character-recognition.html or http://www.codeproject.com/Articles/41709/How-To-Use-Office-2007-OCR-Using-C for more info)

using MODI; static void Main(string args) DocumentClass myDoc = new DocumentClass(); myDoc.Create(@"theDocumentName.tiff"); //we work with the .tiff extension myDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true); foreach (Image anImage in myDoc.Images) Console.WriteLine(anImage.Layout.Text); //here we cout to the console.

How do I get MODI? I do have Microsoft Office 2010 & 2013 installed.
– mYnDstrEAm
Aug 16 '16 at 7:33

I have MS office but the references can't be resolved (they have the yellow warning triangle) and the project therefore won't build).
– Ewan
Mar 23 '17 at 13:49

If anyone is looking into this, I've been trying different options and the following approach yields very good results. The following are the steps to get a working example:

Install-Package Tesseract

tesseract-ocr-3.02.eng.tar.gz English language data for Tesseract 3.02

tessdata

Properties

System.Drawing

Samples

phototest.tif

Program.cs

using System; using Tesseract; using System.Diagnostics; namespace ConsoleApplication class Program public static void Main(string args) var testImagePath = "./phototest.tif"; if (args.Length > 0) testImagePath = args[0]; try var logger = new FormattedConsoleLogger(); var resultPrinter = new ResultPrinter(logger); using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default)) using (var img = Pix.LoadFromFile(testImagePath)) using (logger.Begin("Process image")) var i = 1; using (var page = engine.Process(img)) var text = page.GetText(); logger.Log("Text: 0", text); logger.Log("Mean confidence: 0", page.GetMeanConfidence()); using (var iter = page.GetIterator()) iter.Begin(); do if (i % 2 == 0) using (logger.Begin("Line 0", i)) do using (logger.Begin("Word Iteration")) if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) logger.Log("New block"); if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) logger.Log("New paragraph"); if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) logger.Log("New line"); logger.Log("word: " + iter.GetText(PageIteratorLevel.Word)); while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word)); i++; while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine)); catch (Exception e) Trace.TraceError(e.ToString()); Console.WriteLine("Unexpected Error: " + e.Message); Console.WriteLine("Details: "); Console.WriteLine(e.ToString()); Console.Write("Press any key to continue . . . "); Console.ReadKey(true); private class ResultPrinter readonly FormattedConsoleLogger logger; public ResultPrinter(FormattedConsoleLogger logger) this.logger = logger; public void Print(ResultIterator iter) logger.Log("Is beginning of block: 0", iter.IsAtBeginningOf(PageIteratorLevel.Block)); logger.Log("Is beginning of para: 0", iter.IsAtBeginningOf(PageIteratorLevel.Para)); logger.Log("Is beginning of text line: 0", iter.IsAtBeginningOf(PageIteratorLevel.TextLine)); logger.Log("Is beginning of word: 0", iter.IsAtBeginningOf(PageIteratorLevel.Word)); logger.Log("Is beginning of symbol: 0", iter.IsAtBeginningOf(PageIteratorLevel.Symbol)); logger.Log("Block text: "0"", iter.GetText(PageIteratorLevel.Block)); logger.Log("Para text: "0"", iter.GetText(PageIteratorLevel.Para)); logger.Log("TextLine text: "0"", iter.GetText(PageIteratorLevel.TextLine)); logger.Log("Word text: "0"", iter.GetText(PageIteratorLevel.Word)); logger.Log("Symbol text: "0"", iter.GetText(PageIteratorLevel.Symbol));

FormattedConsoleLogger.cs

using System; using System.Collections.Generic; using System.Text; using Tesseract; namespace ConsoleApplication public class FormattedConsoleLogger const string Tab = " "; private class Scope : DisposableBase private int indentLevel; private string indent; private FormattedConsoleLogger container; public Scope(FormattedConsoleLogger container, int indentLevel) this.container = container; this.indentLevel = indentLevel; StringBuilder indent = new StringBuilder(); for (int i = 0; i < indentLevel; i++) indent.Append(Tab); this.indent = indent.ToString(); public void Log(string format, object args) var message = String.Format(format, args); StringBuilder indentedMessage = new StringBuilder(message.Length + indent.Length * 10); int i = 0; bool isNewLine = true; while (i < message.Length) Console.WriteLine(indentedMessage.ToString()); public Scope Begin() return new Scope(container, indentLevel + 1); protected override void Dispose(bool disposing) if (disposing) var scope = container.scopes.Pop(); if (scope != this) throw new InvalidOperationException("Format scope removed out of order."); private Stack<Scope> scopes = new Stack<Scope>(); public IDisposable Begin(string title = "", params object args) Log(title, args); Scope scope; if (scopes.Count == 0) scope = new Scope(this, 1); else scope = ActiveScope.Begin(); scopes.Push(scope); return scope; public void Log(string format, params object args) if (scopes.Count > 0) ActiveScope.Log(format, args); else Console.WriteLine(String.Format(format, args)); private Scope ActiveScope get var top = scopes.Peek(); if (top == null) throw new InvalidOperationException("No current scope"); return top;

I wish I could vote more than once because this is such a good instruction to get that thing running.
– BloodyRain2k
Dec 25 '15 at 19:06

@BloodyRain2k I'm glad that you found it useful. Thank you for the kind words.
– B.K.
Dec 25 '15 at 22:29

I used the link you mentioned above. In eng folder (github.com/tesseract-ocr/langdata/tree/master/eng) One file is missing i.e. eng.traineddata. Please add this file too.
– Mughees Musaddiq
Feb 26 '16 at 15:09

@MugheesMusaddiq They keep on changing the files a lot, that's why I was reluctant to put any links, as they're not guaranteed to be the same down the line. This is meant more as a guide on how to get started and the lack of link guarantee is why I've pasted so much code here.
– B.K.
Feb 27 '16 at 1:25

Old versions of the language data can be downloaded here: sourceforge.net/projects/tesseract-ocr-alt/files (e.g. because as of right now the NuGet package is of version 3.02 and the only language data available on the site linked bove is 3.04; alternatively the Wayback Machine can be used)
– mYnDstrEAm
Aug 16 '16 at 7:35

I'm using tesseract OCR engine with TessNet2 (a C# wrapper - http://www.pixel-technology.com/freeware/tessnet2/).

Some basic code:

using tessnet2;

...

Bitmap image = new Bitmap(@"u:user filesbwalker2849257.tif"); tessnet2.Tesseract ocr = new tessnet2.Tesseract(); ocr.SetVariable("tessedit_char_whitelist", "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.,$-/#&=()"':?"); // Accepted characters ocr.Init(@"C:UsersbwalkerDocumentsVisual Studio 2010ProjectstessnetWinFormstessnetWinFormsbinRelease", "eng", false); // Directory of your tessdata folder List<tessnet2.Word> result = ocr.DoOCR(image, System.Drawing.Rectangle.Empty); string Results = ""; foreach (tessnet2.Word word in result) Results += word.Confidence + ", " + word.Text + ", " + word.Left + ", " + word.Top + ", " + word.Bottom + ", " + word.Right + "n";

In your link, there is another link "Download binary here" and it doesn't work. In fact this link is on many websites and it doesn't work on any of them. Does anyone know where the tessnet2.dll can be downloaded from?
– Ewan
Mar 22 '17 at 13:13

I actually found tessnet2 in NuGet, not sure why I didn't look there first. It stops on the ocr.Init line when I run it though.Is there meant to be something specific in that directory? tessnet2_32.dll is in my "tessdata" folder as is my application exe file. Any idea why it stops? It simply doesn't do anything.
– Ewan
Mar 23 '17 at 15:30

There is a .NET wrapper for Tesseract 3.01: https://github.com/charlesw/tesseract-ocr-dotnet

Another Option to this is to use Neevia Document Converter which has inbuilt OCR capability. You can run pretty much any file type and it will product a pdf that is essential a big text document, which you can then open and search through using ITextSharper

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt