OCR Text Extractor
Extract text from any image — powered by Tesseract.js via CDN
Frequently Asked Questions
What is OCR and how does it work?
OCR (Optical Character Recognition) is technology that reads text from images. pdfGens uses Tesseract.js — a WebAssembly port of Google's Tesseract engine — running entirely in your browser to extract text from scanned documents, screenshots and photos.
Which languages does the OCR support?
pdfGens OCR supports English, Hindi, French, German, Spanish, Portuguese, Chinese (Simplified), Japanese, and Arabic. Select your language from the dropdown before extracting.
Why does OCR take time on first use?
The first time you run OCR, Tesseract.js downloads the language data file (about 5MB). This is cached in your browser, so subsequent uses are much faster.
Can I extract text from a scanned PDF?
First convert your scanned PDF to an image using our PDF to Images tool, then run OCR on the resulting image.