How to Redact Scanned PDFs: OCR-Powered Text Detection in Images

Last month a paralegal called me in a mild panic. She had 200 pages of discovery documents that needed redaction before going to opposing counsel. Standard process. She opened the PDF in Acrobat, tried to select the first name she wanted to redact, and nothing happened.

The document was scanned. Not a digital PDF. Just pictures of paper. Two hundred pages of pictures.

"It won't let me select anything," she said.

I asked how the documents were created. Turned out a client had photographed contracts on their phone, emailed the photos, and someone converted them to PDF. Every character was pixels. Not text. Pixels.

This is the scanned PDF problem, and if you work with documents, you've probably hit it. Tax returns from a scanner. Contracts signed and photographed. Medical records that came through a fax machine. Historical documents digitized years ago. Anything that passed through a copier, scanner, fax, or camera.

Standard redaction tools don't work on these documents. The PDF redaction feature is looking for text objects, and there are no text objects. There's one big image per page with a bunch of colored pixels that happen to form letters.

Redacting scanned PDFs requires OCR (Optical Character Recognition) to find the text first, then redaction that actually burns the changes into the image. Let me show you how this works.

The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.

Why Standard Redaction Fails on Scanned PDFs

The confusion comes from what a PDF actually contains.

Native PDFs store actual text data. Each character is a text object with a font, position, and encoding. When you drag your cursor to select text, you're selecting those objects. Redaction tools remove the objects and draw black boxes where they used to be.

Image PDFs store pictures of text. The whole page is a raster image, like a JPEG. What looks like the letter "A" is just a pattern of dark and light pixels. There's nothing for the selection cursor to grab.

Mixed PDFs have both. Some pages are native text. Others are scanned images. Some pages have native text with embedded images that also contain text (like a screenshot or photo pasted into a Word document that got exported to PDF).

When you try to redact a scanned PDF with Adobe Acrobat's redaction tool:

  1. The tool looks for text objects in the PDF structure
  2. It finds nothing. The PDF contains one image per page.
  3. You can draw a black box on top of the image
  4. But you're adding a layer, not removing data

The original pixels sit right underneath. Someone can delete the overlay, extract the image, or examine the PDF structure. The data isn't gone.

This is why so many "redacted" documents have failed. People draw black boxes thinking they've removed information. They've hidden it. That's different.

True scanned PDF redaction requires:

  1. Converting images to text (OCR)
  2. Finding the sensitive text and its location
  3. Modifying the actual image (not just adding a layer on top)
  4. Burning the redaction into the image permanently

How Many Documents Are Actually Scanned?

More than you'd think.

A 2022 survey of enterprise document processing found that 40 to 60 percent of incoming documents are image-based or mixed format. In some industries it's higher:

Legal: Discovery documents, signed contracts, court filings, faxed correspondence

Healthcare: Patient intake forms, insurance cards, referral letters, signed consents

Finance: Tax documents, signed agreements, bank statements, loan applications

HR: Employment applications, identification documents, signed policies

Real Estate: Signed contracts, property records, title documents, inspections

If your redaction workflow only handles native PDFs, you're missing a huge chunk of your document flow. And if you're trying to prepare documents for AI processing, those scanned documents are exactly the ones most likely to contain sensitive information (because they're signed forms and official records).

How OCR-Based Redaction Actually Works

Effective scanned PDF redaction follows a specific sequence:

Step 1: Figure Out What You're Dealing With

Before processing, analyze the document:

Document Analysis Result:
- Total pages: 12
- Native text pages: 7
- Image-only pages: 4
- Mixed pages: 1
- Embedded images with text: 3
- Image resolution: 300 DPI average
- Languages detected: English

This tells you what processing each page needs:

  • Native text pages: Standard redaction
  • Image pages: OCR plus image modification
  • Mixed pages: Both approaches
  • Embedded images: Extract, process, re-embed

Step 2: OCR Processing

Optical Character Recognition turns image pixels into text.

How OCR works:

  1. Preprocessing: Straighten the image, reduce noise, improve contrast
  2. Text detection: Find regions containing text
  3. Character recognition: Convert pixel patterns to letters
  4. Post-processing: Spell check, reconstruct layout

What OCR gives you:

  • Extracted text (the actual characters)
  • Bounding boxes (the X,Y coordinates where each word appears on the page)
  • Confidence scores (how sure the engine is about each recognition)

Modern OCR engines like Tesseract, Google Cloud Vision, and AWS Textract achieve 95 to 99 percent accuracy on clean scanned documents. Accuracy drops with poor scans, low resolution, weird fonts, handwriting, and multi-column layouts.

Step 3: Find the Sensitive Data

Once you have text with coordinates, standard PII detection applies:

Named Entity Recognition:

  • "John Smith" found at coordinates (120, 340, 250, 365)
  • "Acme Corporation" found at (80, 420, 280, 445)
  • "123 Main Street" found at (300, 500, 500, 525)

Pattern Matching:

  • SSN pattern found at (150, 600, 280, 625)
  • Phone pattern found at (320, 600, 450, 625)
  • Email pattern found at (100, 680, 350, 705)

The key difference from native PDF detection: you need coordinates, not just the text itself. Those coordinates tell you where to burn the redaction into the image.

Step 4: Modify the Image

This is where scanned PDF redaction differs fundamentally from native PDF redaction.

For native PDFs, you delete text objects. For scanned PDFs, you modify the image:

Pixel burn-in:

  1. Load the page image into memory
  2. Draw filled black rectangles at the bounding box coordinates
  3. Re-encode the image (which permanently destroys the underlying pixels)
  4. Replace the page in the PDF

Image reconstruction:

  1. Extract all text via OCR
  2. Rebuild the image without sensitive regions
  3. Replace the original image

Flattening:

  1. Add redaction overlays to the image
  2. Flatten all layers into a single image
  3. Re-encode and replace

All approaches share one requirement: the original pixels must be destroyed, not just covered. When you're done, there should be no way to recover what was underneath.

Step 5: Generate Output

The final redacted PDF contains:

  • Modified images with burned-in redactions
  • Optionally, an invisible text layer for searchability
  • Metadata stripped of sensitive information
  • An audit manifest showing what was redacted

The Hard Parts (And How to Handle Them)

OCR Misreads

Problem: OCR makes mistakes, especially on poor quality scans.

"John Smith" might OCR as "J0hn 5mith" (zero instead of O, five instead of S).

If your PII detection uses exact pattern matching, it misses the mangled version.

Solutions:

  • Fuzzy matching for names (allow close matches, not just exact)
  • Multiple pattern variants for structured data
  • Lower the confidence threshold (catch more, review more)
  • Improve image quality before OCR

Handwritten Content

Problem: Handwriting is much harder to OCR than print.

A form might have printed labels (OCR'd correctly) and handwritten values (partially recognized or missed entirely).

Solutions:

  • Use OCR engines with handwriting support
  • Redact entire form fields, not just specific text
  • Take a conservative approach: redact areas where handwriting appears
  • Flag documents with handwriting for human review

Multi-Column Layouts

Problem: Complex layouts confuse OCR engines.

Text from different columns gets merged or sequenced incorrectly. Pattern detection fails because the SSN digits ended up interleaved with text from the adjacent column.

Solutions:

  • Layout analysis before character recognition
  • Column detection with separate processing for each
  • Document-type-specific configurations
  • Region-based redaction (redact areas rather than specific text)

Mixed Native and Image Content

Problem: A single document has native text on some pages and scanned images on others. Or a single page has both.

Solutions:

  • Analyze each page individually
  • Detect layers on mixed pages
  • Use a unified pipeline that handles both types
  • Extract and separately process embedded images

Images Inside Images

Problem: A native PDF contains an embedded photo that itself shows text. Like a screenshot pasted into a document, or a photo of a whiteboard in meeting notes.

Standard redaction handles the native text. The text in the embedded photo gets missed.

Solutions:

  • Recursively extract embedded images
  • Process extracted images through the OCR pipeline
  • Replace embedded images with redacted versions
  • Flag documents with embedded images for review

Tool Options

Adobe Acrobat Pro DC

Acrobat can do this, but it takes multiple steps.

Process:

  1. Open scanned PDF
  2. Tools → Scan & OCR → Recognize Text
  3. Wait for OCR
  4. Tools → Redact → Mark for Redaction
  5. Search for patterns or manually select
  6. Apply redaction

Limitations:

  • OCR and redaction are completely separate
  • No automated PII detection (you search for patterns manually)
  • One document at a time
  • Limited handling of mixed content

Google Cloud Document AI

Enterprise-grade OCR with entity extraction.

Process:

  1. Upload document to Document AI processor
  2. Get structured data with entity annotations
  3. Build redaction logic on the extracted entities
  4. Apply redaction to source images

Limitations:

  • Cloud-only (your documents leave your environment)
  • Requires development work to build the redaction pipeline
  • Per-page cost for high volumes
  • No turnkey redacted PDF output

AWS Textract + Comprehend

Textract does OCR. Comprehend does entity detection.

Process:

  1. Send document to Textract
  2. Send extracted text to Comprehend for PII detection
  3. Map PII back to Textract coordinates
  4. Build redacted PDF using coordinates

Limitations:

  • Multi-service architecture requires integration
  • Cloud processing only
  • Building the image modification layer is your job
  • Complex for non-developers

PaperVeil

Unified OCR plus PII detection plus redaction.

Process:

  1. Upload scanned PDF
  2. Select PII types to detect (names, SSN, email, phone, address, DOB, credit cards)
  3. Add custom patterns or terms
  4. Execute redaction
  5. Download redacted PDF with audit manifest

How it handles scanned PDFs:

  • Automatic detection of image vs. native content
  • Built-in OCR for image pages
  • PII detection on extracted text
  • Burned-in redaction on source images
  • Mixed content handling without separate steps

Best Practices

Scan Quality Matters

Better scans mean better OCR. Better OCR means more accurate redaction.

Optimal settings:

  • Resolution: 300 DPI minimum (600 DPI for small text)
  • Color: Grayscale or black-and-white for text documents
  • Format: PDF or TIFF (avoid JPEG compression artifacts)
  • Orientation: Straight, not skewed

If you control the scanning process, invest in quality upfront.

Verify After Redaction

Automated redaction is highly accurate but not perfect. For sensitive documents:

  1. Open the redacted PDF
  2. Spot-check key sections
  3. Verify black boxes appear over sensitive areas
  4. Try to select or copy text in redacted areas (should fail)
  5. Check document metadata for residual information

Keep Your Originals

Store unredacted originals in a secure location:

  • Redaction is irreversible
  • You may need originals for legal or business purposes
  • Audit requirements may mandate original retention

Never redact your only copy.

Document Everything

For compliance, maintain records of:

  • What documents were redacted
  • What detection criteria were applied
  • What PII was found and removed
  • When processing occurred
  • Who initiated the redaction

Redaction manifests from tools like PaperVeil provide this automatically.

Test Before Production

Before processing sensitive documents:

  1. Run test documents through your workflow
  2. Verify OCR accuracy on your document types
  3. Confirm PII detection catches relevant patterns
  4. Check that redaction is truly permanent
  5. Validate output format meets downstream requirements

Scanned PDF Redaction for AI Workflows

The most common use case for scanned PDF redaction is preparing documents for AI processing:

Workflow Pattern

Scanned documents arrive (email, upload, integration)
       ↓
[Document Analysis]
Detect native vs. image content
       ↓
[OCR Processing] (for image content)
Extract text with coordinates
       ↓
[PII Detection]
Find sensitive data in extracted text
       ↓
[Image Modification]
Burn redactions into source images
       ↓
[Sanitized PDF Output]
Ready for AI processing
       ↓
[LLM Analysis]
Summarize, extract, classify

Automation Example

Trigger: New email with PDF attachment
       ↓
Action: Send to redaction API (handles OCR automatically)
       ↓
Action: Send redacted PDF to Claude API
       ↓
Action: Deliver AI analysis to user

The redaction layer handles all document types. No branching logic for scanned versus native.

The Bottom Line

Scanned PDFs are a significant portion of business documents. They require special handling for redaction because standard tools designed for native PDFs don't work on images.

Effective scanned PDF redaction requires OCR to extract text, PII detection on the extracted text with coordinates, image modification to permanently remove sensitive content, and mixed content handling for real-world documents.

The technology exists to handle this automatically. Tools like PaperVeil combine OCR, detection, and redaction in a single workflow, processing scanned, native, and mixed PDFs without separate steps for each type.

For organizations preparing documents for AI processing, automated scanned PDF redaction eliminates a significant barrier. Documents that couldn't be processed before (because they were "just images") become accessible while sensitive data stays protected.

The quality of your source scans determines the ceiling on redaction quality. But modern OCR handles most business documents effectively, and intelligent PII detection catches sensitive data even when OCR isn't perfect.

Don't let scanned PDFs block your document workflows. The right tools make image-based content as processable as native text.


PaperVeil lets you redact all your sensitive information from pdfs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.