How to Redact Scanned PDFs: OCR-Powered Text Detection in Images

Last month a paralegal called me in a mild panic. She had 200 pages of discovery documents that needed redaction before going to opposing counsel. Standard process. She opened the PDF in Acrobat, tried to select the first name she wanted to redact, and nothing happened.

The document was scanned. Not a digital PDF. Just pictures of paper. Two hundred pages of pictures.

"It won't let me select anything," she said.

I asked how the documents were created. Turned out a client had photographed contracts on their phone, emailed the photos, and someone converted them to PDF. Every character was pixels. Not text. Pixels.

This is the scanned PDF problem, and if you work with documents, you've probably hit it. Tax returns from a scanner. Contracts signed and photographed. Medical records that came through a fax machine. Historical documents digitized years ago. Anything that passed through a copier, scanner, fax, or camera.

Standard redaction tools don't work on these documents. The PDF redaction feature is looking for text objects, and there are no text objects. There's one big image per page with a bunch of colored pixels that happen to form letters.

Redacting scanned PDFs requires OCR (Optical Character Recognition) to find the text first, then redaction that actually burns the changes into the image. Let me show you how this works.

The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.

Why Standard Redaction Fails on Scanned PDFs

The confusion comes from what a PDF actually contains.

Native PDFs store actual text data. Each character is a text object with a font, position, and encoding. When you drag your cursor to select text, you're selecting those objects. Redaction tools remove the objects and draw black boxes where they used to be.

Image PDFs store pictures of text. The whole page is a raster image, like a JPEG. What looks like the letter "A" is just a pattern of dark and light pixels. There's nothing for the selection cursor to grab.

Mixed PDFs have both. Some pages are native text. Others are scanned images. Some pages have native text with embedded images that also contain text (like a screenshot or photo pasted into a Word document that got exported to PDF).

When you try to redact a scanned PDF with Adobe Acrobat's redaction tool:

The tool looks for text objects in the PDF structure
It finds nothing. The PDF contains one image per page.
You can draw a black box on top of the image
But you're adding a layer, not removing data

The original pixels sit right underneath. Someone can delete the overlay, extract the image, or examine the PDF structure. The data isn't gone.

This is why so many "redacted" documents have failed. People draw black boxes thinking they've removed information. They've hidden it. That's different.

True scanned PDF redaction requires:

Converting images to text (OCR)
Finding the sensitive text and its location
Modifying the actual image (not just adding a layer on top)
Burning the redaction into the image permanently

How Many Documents Are Actually Scanned?

More than you'd think.

A 2022 survey of enterprise document processing found that 40 to 60 percent of incoming documents are image-based or mixed format. In some industries it's higher:

Legal: Discovery documents, signed contracts, court filings, faxed correspondence

Healthcare: Patient intake forms, insurance cards, referral letters, signed consents

Finance: Tax documents, signed agreements, bank statements, loan applications

HR: Employment applications, identification documents, signed policies

Real Estate: Signed contracts, property records, title documents, inspections

If your redaction workflow only handles native PDFs, you're missing a huge chunk of your document flow. And if you're trying to prepare documents for AI processing, those scanned documents are exactly the ones most likely to contain sensitive information (because they're signed forms and official records).

How OCR-Based Redaction Actually Works

Effective scanned PDF redaction follows a specific sequence:

Step 1: Figure Out What You're Dealing With

Before processing, analyze the document:

Document Analysis Result:
- Total pages: 12
- Native text pages: 7
- Image-only pages: 4
- Mixed pages: 1
- Embedded images with text: 3
- Image resolution: 300 DPI average
- Languages detected: English

This tells you what processing each page needs:

Native text pages: Standard redaction
Image pages: OCR plus image modification
Mixed pages: Both approaches
Embedded images: Extract, process, re-embed

Step 2: OCR Processing

Optical Character Recognition turns image pixels into text.

How OCR works:

Preprocessing: Straighten the image, reduce noise, improve contrast
Text detection: Find regions containing text
Character recognition: Convert pixel patterns to letters
Post-processing: Spell check, reconstruct layout

What OCR gives you:

Extracted text (the actual characters)
Bounding boxes (the X,Y coordinates where each word appears on the page)
Confidence scores (how sure the engine is about each recognition)

Modern OCR engines like Tesseract, Google Cloud Vision, and AWS Textract achieve 95 to 99 percent accuracy on clean scanned documents. Accuracy drops with poor scans, low resolution, weird fonts, handwriting, and multi-column layouts.

Step 3: Find the Sensitive Data

Once you have text with coordinates, standard PII detection applies:

Named Entity Recognition:

"John Smith" found at coordinates (120, 340, 250, 365)
"Acme Corporation" found at (80, 420, 280, 445)
"123 Main Street" found at (300, 500, 500, 525)

Pattern Matching:

SSN pattern found at (150, 600, 280, 625)
Phone pattern found at (320, 600, 450, 625)
Email pattern found at (100, 680, 350, 705)

The key difference from native PDF detection: you need coordinates, not just the text itself. Those coordinates tell you where to burn the redaction into the image.

Step 4: Modify the Image

This is where scanned PDF redaction differs fundamentally from native PDF redaction.

For native PDFs, you delete text objects. For scanned PDFs, you modify the image:

Pixel burn-in:

Load the page image into memory
Draw filled black rectangles at the bounding box coordinates
Re-encode the image (which permanently destroys the underlying pixels)
Replace the page in the PDF

Image reconstruction:

Extract all text via OCR
Rebuild the image without sensitive regions
Replace the original image

Flattening:

Add redaction overlays to the image
Flatten all layers into a single image
Re-encode and replace

All approaches share one requirement: the original pixels must be destroyed, not just covered. When you're done, there should be no way to recover what was underneath.

Step 5: Generate Output

The final redacted PDF contains:

Modified images with burned-in redactions
Optionally, an invisible text layer for searchability
Metadata stripped of sensitive information
An audit manifest showing what was redacted

The Hard Parts (And How to Handle Them)

OCR Misreads

Problem: OCR makes mistakes, especially on poor quality scans.

"John Smith" might OCR as "J0hn 5mith" (zero instead of O, five instead of S).

If your PII detection uses exact pattern matching, it misses the mangled version.

Solutions:

Fuzzy matching for names (allow close matches, not just exact)
Multiple pattern variants for structured data
Lower the confidence threshold (catch more, review more)
Improve image quality before OCR

Handwritten Content

Problem: Handwriting is much harder to OCR than print.

A form might have printed labels (OCR'd correctly) and handwritten values (partially recognized or missed entirely).

Solutions:

Use OCR engines with handwriting support
Redact entire form fields, not just specific text
Take a conservative approach: redact areas where handwriting appears
Flag documents with handwriting for human review

Multi-Column Layouts

Problem: Complex layouts confuse OCR engines.

Text from different columns gets merged or sequenced incorrectly. Pattern detection fails because the SSN digits ended up interleaved with text from the adjacent column.

Solutions:

Layout analysis before character recognition
Column detection with separate processing for each
Document-type-specific configurations
Region-based redaction (redact areas rather than specific text)

Mixed Native and Image Content

Problem: A single document has native text on some pages and scanned images on others. Or a single page has both.

Solutions:

Analyze each page individually
Detect layers on mixed pages
Use a unified pipeline that handles both types
Extract and separately process embedded images

Images Inside Images

Problem: A native PDF contains an embedded photo that itself shows text. Like a screenshot pasted into a document, or a photo of a whiteboard in meeting notes.

Standard redaction handles the native text. The text in the embedded photo gets missed.

Solutions:

Recursively extract embedded images
Process extracted images through the OCR pipeline
Replace embedded images with redacted versions
Flag documents with embedded images for review

Tool Options

Adobe Acrobat Pro DC

Acrobat can do this, but it takes multiple steps.

Process:

Open scanned PDF
Tools → Scan & OCR → Recognize Text
Wait for OCR
Tools → Redact → Mark for Redaction
Search for patterns or manually select
Apply redaction

Limitations:

OCR and redaction are completely separate
No automated PII detection (you search for patterns manually)
One document at a time
Limited handling of mixed content

Google Cloud Document AI

Enterprise-grade OCR with entity extraction.

Process:

Upload document to Document AI processor
Get structured data with entity annotations
Build redaction logic on the extracted entities
Apply redaction to source images

Limitations:

Cloud-only (your documents leave your environment)
Requires development work to build the redaction pipeline
Per-page cost for high volumes
No turnkey redacted PDF output

AWS Textract + Comprehend

Textract does OCR. Comprehend does entity detection.

Process:

Send document to Textract
Send extracted text to Comprehend for PII detection
Map PII back to Textract coordinates
Build redacted PDF using coordinates

Limitations:

Multi-service architecture requires integration
Cloud processing only
Building the image modification layer is your job
Complex for non-developers

PaperVeil

Unified OCR plus PII detection plus redaction.

Process:

Upload scanned PDF
Select PII types to detect (names, SSN, email, phone, address, DOB, credit cards)
Add custom patterns or terms
Execute redaction
Download redacted PDF with audit manifest

How it handles scanned PDFs:

Automatic detection of image vs. native content
Built-in OCR for image pages
PII detection on extracted text
Burned-in redaction on source images
Mixed content handling without separate steps

Best Practices

Scan Quality Matters

Better scans mean better OCR. Better OCR means more accurate redaction.

Optimal settings:

Resolution: 300 DPI minimum (600 DPI for small text)
Color: Grayscale or black-and-white for text documents
Format: PDF or TIFF (avoid JPEG compression artifacts)
Orientation: Straight, not skewed

If you control the scanning process, invest in quality upfront.

Verify After Redaction

Automated redaction is highly accurate but not perfect. For sensitive documents:

Open the redacted PDF
Spot-check key sections
Verify black boxes appear over sensitive areas
Try to select or copy text in redacted areas (should fail)
Check document metadata for residual information

Keep Your Originals

Store unredacted originals in a secure location:

Redaction is irreversible
You may need originals for legal or business purposes
Audit requirements may mandate original retention

Never redact your only copy.

Document Everything

For compliance, maintain records of:

What documents were redacted
What detection criteria were applied
What PII was found and removed
When processing occurred
Who initiated the redaction

Redaction manifests from tools like PaperVeil provide this automatically.

Test Before Production

Before processing sensitive documents:

Run test documents through your workflow
Verify OCR accuracy on your document types
Confirm PII detection catches relevant patterns
Check that redaction is truly permanent
Validate output format meets downstream requirements

Scanned PDF Redaction for AI Workflows

The most common use case for scanned PDF redaction is preparing documents for AI processing:

Workflow Pattern

Scanned documents arrive (email, upload, integration)
       ↓
[Document Analysis]
Detect native vs. image content
       ↓
[OCR Processing] (for image content)
Extract text with coordinates
       ↓
[PII Detection]
Find sensitive data in extracted text
       ↓
[Image Modification]
Burn redactions into source images
       ↓
[Sanitized PDF Output]
Ready for AI processing
       ↓
[LLM Analysis]
Summarize, extract, classify

Automation Example

Trigger: New email with PDF attachment
       ↓
Action: Send to redaction API (handles OCR automatically)
       ↓
Action: Send redacted PDF to Claude API
       ↓
Action: Deliver AI analysis to user

The redaction layer handles all document types. No branching logic for scanned versus native.

The Bottom Line

Scanned PDFs are a significant portion of business documents. They require special handling for redaction because standard tools designed for native PDFs don't work on images.

Effective scanned PDF redaction requires OCR to extract text, PII detection on the extracted text with coordinates, image modification to permanently remove sensitive content, and mixed content handling for real-world documents.

The technology exists to handle this automatically. Tools like PaperVeil combine OCR, detection, and redaction in a single workflow, processing scanned, native, and mixed PDFs without separate steps for each type.

For organizations preparing documents for AI processing, automated scanned PDF redaction eliminates a significant barrier. Documents that couldn't be processed before (because they were "just images") become accessible while sensitive data stays protected.

The quality of your source scans determines the ceiling on redaction quality. But modern OCR handles most business documents effectively, and intelligent PII detection catches sensitive data even when OCR isn't perfect.

Don't let scanned PDFs block your document workflows. The right tools make image-based content as processable as native text.

PaperVeil lets you redact all your sensitive information from pdfs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.