Multilanguage OCR
- 0 Collaborators
An OCR tool to extract text in multiple languages automatically using the Tesseract library by Google developed on Intel Optimized Python. The project allows adding own sets of handwritings or training models which are not previously available to facilitate recognition of text from new handwritings. ...learn more
Project status: Published/In Market
Artificial Intelligence, Graphics and Media
            Groups
            
              Student Developers for AI, 
            
              Artificial Intelligence India
            
          
            Intel Technologies
            
              
                Intel Python
              
            
          
Overview / Usage
This OCR, built on top with tesseract is presently able to extract text in English, Hindi and Bengali with a 70% accuracy. I wish to expand this to cover the other Indian languages.
Methodology / Approach
First, the text found in the images is broken down into bounded boxes using OpenCV and then for each box found, a CNN predicts the alphabet matched. For each language, a different model is used.
Technologies Used
Intel Optimized Python
OpenCV
Tesseract
