Improving the Foundation of Search: Advanced OCR
Litigation and investigations often involve voluminous documents of all formats. The best method for searching documents is through the text of the original native electronic documents. However, the original electronic versions of documents are not always available. Many records are scanned images or photographs, where current tools can struggle with recognising text, let alone making non-text images searchable.
Conventional Optical Character Recognition (OCR) has always had its limits. It struggles with complex layouts, low-quality scans, or documents that combine text and images. Handwriting and non-text elements are often ignored, dismissed as too hard or too expensive to process. The uncomfortable truth is that the searchability of scanned material has always been flawed.
An emerging approach uses generative Artificial Intelligence(gen-AI) to enhance content understanding. Perhaps this is more aptly described as Optical Content Recognition (aka “OCR 2.0”). The novel gen-AI solutions' ability to “see” content significantly improves the searchability of scanned documents and images. This article explores the mechanisms that “OCR2.0” uses to improve on conventional OCR.
Superior Text and Content Recognition
“OCR 2.0” dramatically improves upon legacy OCR in extracting text from challenging documents. The technology has the ability to decipher text from low-quality scans, faded copies, or unconventional layouts, something that is not achievable with older OCR software. In fact, these models can often detect text that is difficult for a human to read from a scanned image. The implications are significant. Scanned image quality has less impact on the text conversion process, opening the door for the use of scanning services beyond specialised hardcopy litigation support bureaus. This provides more opportunity to quickly get scanned documents into the hands of lawyers.
Handwriting Recognition
Another leap in “OCR 2.0” is the ability to recognise and transcribe handwritten annotations and text. Traditional OCR has always struggled with handwritten notes, especially in legal contexts where margin notes, interview records, or hastily scribbled annotations can hold vital information. New AI-powered, multimodal models can now match, and sometimes exceed, the accuracy of specialised handwriting recognition tools. For example, OpenAI’s GPT-4 (vision mode) now ranks among the top performers in this field, reading cursive or otherwise illegible handwriting that previously required slow, manual review.[1]
Image Analysis and Searchability
Not all critical evidence is text-based – photographs, diagrams, or schematics are often used in business. Historically, such images were hard to search or analyse beyond a manual glance or a machine tag/label. AI image recognition is changing that by describing the image in text. In essence, if a picture is worth a thousand words, generative AI can capture those words and make them searchable within the document.
Consider a litigation database filled with emails and their attachments: some attachments are photos (e.g. a site inspection image, a whiteboard snapshot, or a product image). AI can scan those images and label them with content descriptors (e.g. “contains building blueprints,” “person lifting heavy object,” or “screenshot of application interface”). The descriptions are noted in the position of the text where the image is. Searching for the words in the description will identify the document with the image. Machine learning and vector indexes can help find words and documents with similar related content.
Image descriptions can be powerful in cases with large multimedia collections where media review is often manual and constrained. It is equally as powerful where images are integrated with text-based documents. Consider the common practice of sending photos and screenshots via email or text message. Image descriptions enable AI search and analysis to operate seamlessly across both image and text content.
Preserving Text Structure
“OCR 2.0” can also tackle complex layouts, such as multi-column pages, tables, and mathematical notations, where traditional OCR provided a cursory identification of characters and inaccurate results. The scanned output can better represent the structure of documents, not just the plain text. For instance, “OCR 2.0” can preserve the text hierarchy (headings, paragraphs, lists, and tables). This structured extraction makes the text more readable and preserves context.
This improved output is not only helpful for humans and Boolean search. The improvement aids generative AI's understanding of the document. For example, document headings are clear, table cells are preserved, and formulas and equations are captured properly. This improved understanding can significantly benefit downstream AI tasks. For instance, a summarisation algorithm can recognise headings and return a structured summary (one paragraph per section), or an AI analysis tool can identify the headings on a page to indicate the title of a document.
Efficiency Gains and Improved Outcomes
The cumulative effect of these advancements is a much more comprehensive understanding of the documents to effect search and automate analysis. Key outcomes include
- Increased Searchability & Defensibility: An additional description of images combined with keywords, machine learning, or generative AI may better demonstrate the attempt to identify image content along with text-based content
- Reduced Dependence on Document Scanning Quality: Document scanning is an intensive manual process that is slow and cumbersome. A reduction in scanning quality doesn’t necessarily reduce the ability to understand and search scanned content. OCR 2.0 means more flexible, faster document intake, without relying solely on specialist scanning bureaus.
- Improved Search and AI Analytical Quality: There is an adage in IT where “Garbage In, Garbage Out” refers to the downstream implications of using poor data for any analytic task. Having enhanced data extracted from documents provides the greatest opportunity for accurate searching and understanding data relationships.
- Improved AI Analytical Abilities: Enhanced text structure enables text-based AI to be considered for more tasks. Text-based generative AI is more cost-effective with less latency. The high-quality text, image descriptions and formatting allow for more complex document analysis and interrogation.
Conclusion: Reinforcing the eDiscovery Foundation
From scanning warehouse boxes of paper to analysing digital images by the millions, information search and retrieval exercises, like discovery, have had to consider these limitations. OCR 2.0 represents the next step, turning more of the available pixels of evidence – text or image – into usable, searchable information. Better text extraction (even for poor scans, complex layouts, and handwriting) means most document content can be considered and subjected to text retrieval.
Litigators and investigators who harness these tools may gain a competitive edge. Imagine finding that one hand-annotated memo in a foreign language that cracks a case, or quickly aggregating all photos showing a product defect to strengthen an expert report – these scenarios are now within reach.
The goal of dispute resolution is to get to the truth of the matter in an efficient, defensible way. By embracing these advancements, legal teams can focus their energy on advocacy, confident that their search strategy hasn’t missed evidence, whether it’s typed, handwritten, or photographed.
The goal of dispute resolution is to get to the truth and efficiently resolve the issues in dispute. By embracing these advancements, legal teams can focus on advocacy, confident that their search strategy hasn’t missed evidence in any medium in which it is captured (typed, handwritten, or photographed).
[1] Handwriting Recognition Benchmark: LLMs vs OCRs in 2025, Cem Dilmegani, Jan 2025, https://research.aimultiple.com/handwriting-recognition/
Ben Kennedy
Managing Director
Interview multiple candidates
Lorem ipsum dolor sit amet, consectetur adipiscing elit proin mi pellentesque lorem turpis feugiat non sed sed sed aliquam lectus sodales gravida turpis maassa odio faucibus accumsan turpis nulla tellus purus ut cursus lorem in pellentesque risus turpis eget quam eu nunc sed diam.
Search for the right experience
Lorem ipsum dolor sit amet, consectetur adipiscing elit proin mi pellentesque lorem turpis feugiat non sed sed sed aliquam lectus sodales gravida turpis maassa odio.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
- Porttitor nibh est vulputate vitae sem vitae.
- Netus vestibulum dignissim scelerisque vitae.
- Amet tellus nisl risus lorem vulputate velit eget.
Ask for past work examples & results
Lorem ipsum dolor sit amet, consectetur adipiscing elit consectetur in proin mattis enim posuere maecenas non magna mauris, feugiat montes, porttitor eget nulla id id.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
- Netus vestibulum dignissim scelerisque vitae.
- Porttitor nibh est vulputate vitae sem vitae.
- Amet tellus nisl risus lorem vulputate velit eget.
Vet candidates & ask for past references before hiring
Lorem ipsum dolor sit amet, consectetur adipiscing elit ut suspendisse convallis enim tincidunt nunc condimentum facilisi accumsan tempor donec dolor malesuada vestibulum in sed sed morbi accumsan tristique turpis vivamus non velit euismod.
“Lorem ipsum dolor sit amet, consectetur adipiscing elit nunc gravida purus urna, ipsum eu morbi in enim”
Once you hire them, give them access for all tools & resources for success
Lorem ipsum dolor sit amet, consectetur adipiscing elit ut suspendisse convallis enim tincidunt nunc condimentum facilisi accumsan tempor donec dolor malesuada vestibulum in sed sed morbi accumsan tristique turpis vivamus non velit euismod.