I’m currently working on a project to make 500k+ pages searchable through a web application. It should find matches and return the blob data.

My work so far has been to get tesseract working for OCR and image magick to deskew and despeckle, however with the text I get back I don’t know what the best approach would be to make it searchable. My thoughts are either converting the text to binary and using MSSQL full text indexing or use elastic search and index the scraped text.

Any thoughts?

