How to Redact a Scanned PDF: OCR and Image-Based Redaction

Scanned PDFs present a unique redaction challenge. When you scan a document, the resulting PDF is essentially a picture of each page. There's no text layer for redaction tools to target. You can't select individual words. Search doesn't work. Standard redaction approaches that work on digital PDFs fail completely on scanned documents.

This matters because many of the most sensitive documents organizations handle are scans: medical records, contracts, historical files, legal documents from discovery, government records from archives. These documents often contain exactly the information that most needs redaction, stored in a format that resists standard redaction methods.

The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.

Why Standard Redaction Doesn't Work on Scans

When you create a PDF digitally (saving from Word, exporting from Excel), the resulting file contains actual text data. Each character is stored as text. Redaction tools can identify and remove this text.

When you scan a paper document, the scanner captures an image. The PDF stores this image. What looks like text is actually pixels in a picture. The PDF has no idea that those pixels spell "John Smith" or "123-45-6789." It just sees dark marks on a light background.

Standard redaction tools work by finding and removing text. With no text in the file, there's nothing for them to find or remove.

Two Approaches to Scanned PDF Redaction

Approach 1: OCR First, Then Redact

OCR (Optical Character Recognition) converts image-based text into actual text data. After OCR, the PDF contains both the original image and a searchable text layer. Redaction tools can then work with the text layer.

Advantages:

Creates searchable documents (useful beyond redaction)
Standard redaction workflows apply after OCR
Can use pattern-based search to find all instances of sensitive data
Text layer makes verification easier

Disadvantages:

OCR isn't perfect; some text may not be recognized
Handwritten content rarely converts well
Processing time for large documents
OCR errors could miss sensitive content

Approach 2: Image-Based Redaction

Instead of converting to text, edit the image directly. Draw permanent boxes that become part of the image pixels. No text layer needed because you're modifying the picture itself.

Advantages:

Works regardless of image quality
No OCR errors to worry about
Direct and immediate
Works on handwritten content

Disadvantages:

Manual process (no pattern search)
Must visually identify all sensitive content
No automatic detection of names, SSNs, etc.
Harder to verify completeness

Method 1: OCR Then Redact in Adobe Acrobat

Adobe Acrobat Pro combines OCR and redaction in one application.

Step 1: Run OCR

Open the scanned PDF in Acrobat Pro
Go to Tools > Scan & OCR
Click Recognize Text > In This File
Choose settings:
- Output: Searchable Image (keeps original appearance)
- Language: Select document language
- Downsample: Leave default unless file size is concern
Click Recognize Text
Wait for processing (can take minutes for large documents)

Step 2: Verify OCR Quality

Before redacting, verify OCR worked correctly:

Press Cmd+F (Mac) or Ctrl+F (Windows)
Search for terms you know are in the document
If search finds them, OCR worked
If search fails, OCR may need adjustment or document may be too poor quality

Step 3: Redact Using Standard Tools

Go to Tools > Redact
Use Find Text to search for sensitive patterns:
- Social Security numbers
- Names you need to redact
- Account numbers
- Dates
Mark found items for redaction
Manually review for items OCR might have missed
Click Apply to permanently remove content

Step 4: Remove Hidden Information

Click Remove Hidden Information in the Redact toolbar
This removes:
- The OCR text layer (important!)
- Metadata
- Hidden content

Removing the OCR text layer is critical. After image-based redaction, the text layer still contains the original text. The text layer must be removed or flattened.

Step 5: Save and Verify

Save with a new filename
Verify redaction by:
- Trying to select "redacted" areas
- Searching for redacted terms
- Checking that the text layer is gone

Method 2: Image-Based Redaction

For documents where OCR won't work well (handwritten, poor quality, mixed content), edit the image directly.

Option A: Using Adobe Acrobat

Open the scanned PDF
Go to Tools > Edit PDF
This allows editing the page as an image
Use drawing tools to place permanent boxes over sensitive content
Flatten annotations: File > Print to PDF, or use Flatten tool
Verify the original content is gone (not just covered)

Note: This method in Acrobat still requires care. Make sure you're editing the image layer, not adding annotations on top.

Option B: Using Image Editor

For complete control:

Export PDF pages as images:
- In Acrobat: File > Export To > Image
- Choose PNG or TIFF for quality
Open images in Photoshop, GIMP, or similar
Use brush or shape tools to paint over sensitive content
Save the edited images
Create new PDF from edited images:
- In Acrobat: File > Create > PDF from File
- Select all edited images

This method guarantees the original content is destroyed, not just covered.

Option C: Using Preview (Mac) - Limited Security

For low-sensitivity documents:

Open PDF in Preview
Use annotation tools to cover content
Export as JPEG or PNG (File > Export)
This flattens annotations into the image
Create new PDF from the image

This provides some protection but isn't forensically secure. Use only when proper tools aren't available.

Special Considerations for Scanned Documents

Quality Issues

Poor quality scans create multiple problems:

OCR accuracy drops dramatically
Faded text may be missed entirely
Skewed pages complicate processing

Before redacting poor quality scans:

Try enhancing the image (contrast, sharpening)
De-skew rotated pages
Consider re-scanning if possible

Mixed Content

Some scanned documents contain:

Typed text (converts well with OCR)
Handwritten notes (converts poorly)
Stamps and signatures (usually ignored by OCR)

For mixed content:

Run OCR for typed content
Manually review for handwritten information
Use image-based redaction for content OCR missed

Multi-Page Documents

Large scanned documents (hundreds of pages) require efficient workflows:

Run OCR on entire document at once
Use pattern-based search to find all instances
Create a checklist of sensitive terms to search
Process in batches if system performance suffers

Color vs. Black and White

Some scanners produce black and white output. This can:

Improve OCR accuracy (higher contrast)
Reduce file size
Make redaction marks less visible on verification

For redaction purposes, high-contrast black and white scans often work best.

Verification Is Critical

Scanned PDF redaction requires extra verification because:

OCR errors may have missed sensitive content
Image editing may not have fully destroyed content
Multiple layers might contain copies of original content

Visual Review

Scroll through every page looking for:

Sensitive content that wasn't redacted
Redaction marks that appear different (suggesting they're annotations, not edits)
Handwritten content that automated tools missed

Layer Check

In PDF software, check for layers:

Look at Layer panel
Toggle layers on/off
Hidden layers might contain original content

Text Extraction Test

Try to extract text from the final document:

Use Edit > Copy on "redacted" areas
Search for redacted terms
Export to plain text and search

If redacted text can still be found, redaction wasn't complete.

Raw File Analysis

For critical documents:

Open the PDF in a text editor
Search for strings that should be redacted
Sensitive data in the raw file means redaction failed

Industry Applications

Medical Records

Healthcare organizations frequently redact scanned medical records for:

Releasing records to third parties
Research with de-identified data
Legal discovery
Insurance audits

For medical records:

All 18 HIPAA identifiers must be found and removed
Handwritten physician notes need special attention
Historical records (poor quality scans) require extra care
Document the redaction process for compliance

Legal Discovery

Law firms process thousands of scanned pages for discovery:

Privilege review before production
Personal information removal
Redacting irrelevant content

For legal discovery:

Establish consistent procedures across the document set
Use Bates numbering before redaction
Maintain privilege logs corresponding to redactions
Consider review platforms with built-in redaction

Government FOIA

Government agencies process scanned historical documents:

Declassification of archived materials
FOIA request fulfillment
Records release programs

For government use:

Apply exemption codes consistently
Document exemption basis for each redaction
Maintain segregability analysis
Consider specialized government redaction tools

Tools Comparison for Scanned PDFs

Feature	Acrobat Pro	ABBYY FineReader	PDFpen Pro
OCR Quality	Excellent	Excellent	Good
Pattern Search	Yes	Yes	Limited
Image Redaction	Yes	Limited	Yes
Batch Processing	Yes	Yes	No
Price	Subscription	One-time + subscription	One-time

ABBYY FineReader deserves special mention for OCR quality. For documents where Adobe's OCR struggles, ABBYY often produces better results.

Summary

Scanned PDFs require special handling because standard redaction tools target text that doesn't exist in image-based documents.

Best approach for quality scans:

Run OCR to create text layer
Use text-based redaction tools
Remove OCR layer after redaction
Verify thoroughly

Best approach for poor quality or handwritten:

Edit the image directly in an image editor
Paint over sensitive content permanently
Create new PDF from edited images
Original content is physically destroyed

Regardless of method:

Verify redaction worked (selection test, search test)
Check for hidden layers containing originals
Keep unredacted copies secure
Document your process

Scanned document redaction takes more time than digital PDF redaction. The extra effort is necessary because the alternative is releasing documents with sensitive information that appears redacted but remains accessible in the file.

PaperVeil lets you redact all your sensitive information from PDFs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.