ocr

Lexically-Aware Semi-Supervised Learning for OCR Post-Correction

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of …

OCR

This NEH-funded project focuses on the development of modern Optical Character Recognition (OCR) and post-correction tools tailored for Indigenous Latin American Languages.

OCR Post-Correction for Endangered Language Texts

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In …