Home Automatic Language detection from Images for OCR character Extraction

# Automatic Language detection from Images for OCR character Extraction

CyborgSuraj
1#
CyborgSuraj Published in 2017-12-07 06:20:36Z
 I am building a software using python in which the image is uploaded.The software will extract the text using tesseract ocr. But I want my software to detect the languages in the images automatically and extract the detected text. Please suggest me some ways to do that,I am ready to do Machine Learning also but i can't determine a perfect pipeline for the process. Thanks In-advance.
Sanat taori
2#
Sanat taori Reply to 2017-12-07 06:44:25Z
 Tesseract has script detection within "OSD", but not language Detection , you cannot detect language automatically you have to specify language.
Sanat taori
3#
Sanat taori Reply to 2017-12-07 09:20:50Z
 The process is complicated, what you need to do is Extract text from image in lang=eng Pass that text to langdetect it is google automatic language detection library Again use that language in tesseract to extract text accurately Or you can use switch case with every language and pass sample text to langdetect to get probability which language is correct. import pytesseract pytesseract.pytesseract.tesseract_cmd = '' # Include the above line, if you don't have tesseract executable in your path # Example tesseract_cmd: 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract' print(pytesseract.image_to_string(Image.open('test.png'))) print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='eng')) sample_text = pytesseract.image_to_string(Image.open('image.jpg'), lang='eng') from langdetect import detect_langs detect_langs(sample_text) 
 You need to login account before you can post.
Processed in 0.299332 second(s) , Gzip On .