Home Automatic Language detection from Images for OCR character Extraction
Reply: 2

Automatic Language detection from Images for OCR character Extraction

CyborgSuraj
1#
CyborgSuraj Published in 2017-12-07 06:20:36Z

I am building a software using python in which the image is uploaded.The software will extract the text using tesseract ocr.

But I want my software to detect the languages in the images automatically and extract the detected text.

Please suggest me some ways to do that,I am ready to do Machine Learning also but i can't determine a perfect pipeline for the process.

Thanks In-advance.

Sanat taori
2#
Sanat taori Reply to 2017-12-07 06:44:25Z

Tesseract has script detection within "OSD", but not language Detection , you cannot detect language automatically you have to specify language.

Sanat taori
3#
Sanat taori Reply to 2017-12-07 09:20:50Z

The process is complicated, what you need to do is

  1. Extract text from image in lang=eng
  2. Pass that text to langdetect it is google automatic language detection library
  3. Again use that language in tesseract to extract text accurately

Or

you can use switch case with every language and pass sample text to langdetect to get probability which language is correct.

import pytesseract

pytesseract.pytesseract.tesseract_cmd = 
'<full_path_to_your_tesseract_executable>'
# Include the above line, if you don't have tesseract executable in your path

# Example tesseract_cmd: 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'

print(pytesseract.image_to_string(Image.open('test.png')))
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='eng'))

sample_text = pytesseract.image_to_string(Image.open('image.jpg'), lang='eng')

from langdetect import detect_langs detect_langs(sample_text)
You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.299332 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO