Last month a paralegal called me in a mild panic. She had 200 pages of discovery documents that needed redaction before going to opposing counsel. Standard process. She opened the PDF in Acrobat, tried to select the first name she wanted to redact, and nothing happened.
The document was scanned. Not a digital PDF. Just pictures of paper. Two hundred pages of pictures.
"It won't let me select anything," she said.
I asked how the documents were created. Turned out a client had photographed contracts on their phone, emailed the photos, and someone converted them to PDF. Every character was pixels. Not text. Pixels.
This is the scanned PDF problem, and if you work with documents, you've probably hit it. Tax returns from a scanner. Contracts signed and photographed. Medical records that came through a fax machine. Historical documents digitized years ago. Anything that passed through a copier, scanner, fax, or camera.
Standard redaction tools don't work on these documents. The PDF redaction feature is looking for text objects, and there are no text objects. There's one big image per page with a bunch of colored pixels that happen to form letters.
Redacting scanned PDFs requires OCR (Optical Character Recognition) to find the text first, then redaction that actually burns the changes into the image. Let me show you how this works.
The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.
Why Standard Redaction Fails on Scanned PDFs
The confusion comes from what a PDF actually contains.
Native PDFs store actual text data. Each character is a text object with a font, position, and encoding. When you drag your cursor to select text, you're selecting those objects. Redaction tools remove the objects and draw black boxes where they used to be.
Image PDFs store pictures of text. The whole page is a raster image, like a JPEG. What looks like the letter "A" is just a pattern of dark and light pixels. There's nothing for the selection cursor to grab.
Mixed PDFs have both. Some pages are native text. Others are scanned images. Some pages have native text with embedded images that also contain text (like a screenshot or photo pasted into a Word document that got exported to PDF).
When you try to redact a scanned PDF with Adobe Acrobat's redaction tool:
- The tool looks for text objects in the PDF structure
- It finds nothing. The PDF contains one image per page.
- You can draw a black box on top of the image
- But you're adding a layer, not removing data
The original pixels sit right underneath. Someone can delete the overlay, extract the image, or examine the PDF structure. The data isn't gone.
This is why so many "redacted" documents have failed. People draw black boxes thinking they've removed information. They've hidden it. That's different.
True scanned PDF redaction requires:
- Converting images to text (OCR)
- Finding the sensitive text and its location
- Modifying the actual image (not just adding a layer on top)
- Burning the redaction into the image permanently
How Many Documents Are Actually Scanned?
More than you'd think.
A 2022 survey of enterprise document processing found that 40 to 60 percent of incoming documents are image-based or mixed format. In some industries it's higher:
Legal: Discovery documents, signed contracts, court filings, faxed correspondence
Healthcare: Patient intake forms, insurance cards, referral letters, signed consents
Finance: Tax documents, signed agreements, bank statements, loan applications
HR: Employment applications, identification documents, signed policies
Real Estate: Signed contracts, property records, title documents, inspections
If your redaction workflow only handles native PDFs, you're missing a huge chunk of your document flow. And if you're trying to prepare documents for AI processing, those scanned documents are exactly the ones most likely to contain sensitive information (because they're signed forms and official records).
How OCR-Based Redaction Actually Works
Effective scanned PDF redaction follows a specific sequence:
Step 1: Figure Out What You're Dealing With
Before processing, analyze the document:
Document Analysis Result:
- Total pages: 12
- Native text pages: 7
- Image-only pages: 4
- Mixed pages: 1
- Embedded images with text: 3
- Image resolution: 300 DPI average
- Languages detected: English
This tells you what processing each page needs:
- Native text pages: Standard redaction
- Image pages: OCR plus image modification
- Mixed pages: Both approaches
- Embedded images: Extract, process, re-embed
Step 2: OCR Processing
Optical Character Recognition turns image pixels into text.
How OCR works:
- Preprocessing: Straighten the image, reduce noise, improve contrast
- Text detection: Find regions containing text
- Character recognition: Convert pixel patterns to letters
- Post-processing: Spell check, reconstruct layout
What OCR gives you:
- Extracted text (the actual characters)
- Bounding boxes (the X,Y coordinates where each word appears on the page)
- Confidence scores (how sure the engine is about each recognition)
Modern OCR engines like Tesseract, Google Cloud Vision, and AWS Textract achieve 95 to 99 percent accuracy on clean scanned documents. Accuracy drops with poor scans, low resolution, weird fonts, handwriting, and multi-column layouts.
Step 3: Find the Sensitive Data
Once you have text with coordinates, standard PII detection applies:
Named Entity Recognition:
- "John Smith" found at coordinates (120, 340, 250, 365)
- "Acme Corporation" found at (80, 420, 280, 445)
- "123 Main Street" found at (300, 500, 500, 525)
Pattern Matching:
- SSN pattern found at (150, 600, 280, 625)
- Phone pattern found at (320, 600, 450, 625)
- Email pattern found at (100, 680, 350, 705)
The key difference from native PDF detection: you need coordinates, not just the text itself. Those coordinates tell you where to burn the redaction into the image.
Step 4: Modify the Image
This is where scanned PDF redaction differs fundamentally from native PDF redaction.
For native PDFs, you delete text objects. For scanned PDFs, you modify the image:
Pixel burn-in:
- Load the page image into memory
- Draw filled black rectangles at the bounding box coordinates
- Re-encode the image (which permanently destroys the underlying pixels)
- Replace the page in the PDF
Image reconstruction:
- Extract all text via OCR
- Rebuild the image without sensitive regions
- Replace the original image
Flattening:
- Add redaction overlays to the image
- Flatten all layers into a single image
- Re-encode and replace
All approaches share one requirement: the original pixels must be destroyed, not just covered. When you're done, there should be no way to recover what was underneath.
Step 5: Generate Output
The final redacted PDF contains:
- Modified images with burned-in redactions
- Optionally, an invisible text layer for searchability
- Metadata stripped of sensitive information
- An audit manifest showing what was redacted
The Hard Parts (And How to Handle Them)
OCR Misreads
Problem: OCR makes mistakes, especially on poor quality scans.
"John Smith" might OCR as "J0hn 5mith" (zero instead of O, five instead of S).
If your PII detection uses exact pattern matching, it misses the mangled version.
Solutions:
- Fuzzy matching for names (allow close matches, not just exact)
- Multiple pattern variants for structured data
- Lower the confidence threshold (catch more, review more)
- Improve image quality before OCR
Handwritten Content
Problem: Handwriting is much harder to OCR than print.
A form might have printed labels (OCR'd correctly) and handwritten values (partially recognized or missed entirely).
Solutions:
- Use OCR engines with handwriting support
- Redact entire form fields, not just specific text
- Take a conservative approach: redact areas where handwriting appears
- Flag documents with handwriting for human review
Multi-Column Layouts
Problem: Complex layouts confuse OCR engines.
Text from different columns gets merged or sequenced incorrectly. Pattern detection fails because the SSN digits ended up interleaved with text from the adjacent column.
Solutions:
- Layout analysis before character recognition
- Column detection with separate processing for each
- Document-type-specific configurations
- Region-based redaction (redact areas rather than specific text)
Mixed Native and Image Content
Problem: A single document has native text on some pages and scanned images on others. Or a single page has both.
Solutions:
- Analyze each page individually
- Detect layers on mixed pages
- Use a unified pipeline that handles both types
- Extract and separately process embedded images
Images Inside Images
Problem: A native PDF contains an embedded photo that itself shows text. Like a screenshot pasted into a document, or a photo of a whiteboard in meeting notes.
Standard redaction handles the native text. The text in the embedded photo gets missed.
Solutions:
- Recursively extract embedded images
- Process extracted images through the OCR pipeline
- Replace embedded images with redacted versions
- Flag documents with embedded images for review
Tool Options
Adobe Acrobat Pro DC
Acrobat can do this, but it takes multiple steps.
Process:
- Open scanned PDF
- Tools → Scan & OCR → Recognize Text
- Wait for OCR
- Tools → Redact → Mark for Redaction
- Search for patterns or manually select
- Apply redaction
Limitations:
- OCR and redaction are completely separate
- No automated PII detection (you search for patterns manually)
- One document at a time
- Limited handling of mixed content
Google Cloud Document AI
Enterprise-grade OCR with entity extraction.
Process:
- Upload document to Document AI processor
- Get structured data with entity annotations
- Build redaction logic on the extracted entities
- Apply redaction to source images
Limitations:
- Cloud-only (your documents leave your environment)
- Requires development work to build the redaction pipeline
- Per-page cost for high volumes
- No turnkey redacted PDF output
AWS Textract + Comprehend
Textract does OCR. Comprehend does entity detection.
Process:
- Send document to Textract
- Send extracted text to Comprehend for PII detection
- Map PII back to Textract coordinates
- Build redacted PDF using coordinates
Limitations:
- Multi-service architecture requires integration
- Cloud processing only
- Building the image modification layer is your job
- Complex for non-developers
PaperVeil
Unified OCR plus PII detection plus redaction.
Process:
- Upload scanned PDF
- Select PII types to detect (names, SSN, email, phone, address, DOB, credit cards)
- Add custom patterns or terms
- Execute redaction
- Download redacted PDF with audit manifest
How it handles scanned PDFs:
- Automatic detection of image vs. native content
- Built-in OCR for image pages
- PII detection on extracted text
- Burned-in redaction on source images
- Mixed content handling without separate steps
Best Practices
Scan Quality Matters
Better scans mean better OCR. Better OCR means more accurate redaction.
Optimal settings:
- Resolution: 300 DPI minimum (600 DPI for small text)
- Color: Grayscale or black-and-white for text documents
- Format: PDF or TIFF (avoid JPEG compression artifacts)
- Orientation: Straight, not skewed
If you control the scanning process, invest in quality upfront.
Verify After Redaction
Automated redaction is highly accurate but not perfect. For sensitive documents:
- Open the redacted PDF
- Spot-check key sections
- Verify black boxes appear over sensitive areas
- Try to select or copy text in redacted areas (should fail)
- Check document metadata for residual information
Keep Your Originals
Store unredacted originals in a secure location:
- Redaction is irreversible
- You may need originals for legal or business purposes
- Audit requirements may mandate original retention
Never redact your only copy.
Document Everything
For compliance, maintain records of:
- What documents were redacted
- What detection criteria were applied
- What PII was found and removed
- When processing occurred
- Who initiated the redaction
Redaction manifests from tools like PaperVeil provide this automatically.
Test Before Production
Before processing sensitive documents:
- Run test documents through your workflow
- Verify OCR accuracy on your document types
- Confirm PII detection catches relevant patterns
- Check that redaction is truly permanent
- Validate output format meets downstream requirements
Scanned PDF Redaction for AI Workflows
The most common use case for scanned PDF redaction is preparing documents for AI processing:
Workflow Pattern
Scanned documents arrive (email, upload, integration)
↓
[Document Analysis]
Detect native vs. image content
↓
[OCR Processing] (for image content)
Extract text with coordinates
↓
[PII Detection]
Find sensitive data in extracted text
↓
[Image Modification]
Burn redactions into source images
↓
[Sanitized PDF Output]
Ready for AI processing
↓
[LLM Analysis]
Summarize, extract, classify
Automation Example
Trigger: New email with PDF attachment
↓
Action: Send to redaction API (handles OCR automatically)
↓
Action: Send redacted PDF to Claude API
↓
Action: Deliver AI analysis to user
The redaction layer handles all document types. No branching logic for scanned versus native.
The Bottom Line
Scanned PDFs are a significant portion of business documents. They require special handling for redaction because standard tools designed for native PDFs don't work on images.
Effective scanned PDF redaction requires OCR to extract text, PII detection on the extracted text with coordinates, image modification to permanently remove sensitive content, and mixed content handling for real-world documents.
The technology exists to handle this automatically. Tools like PaperVeil combine OCR, detection, and redaction in a single workflow, processing scanned, native, and mixed PDFs without separate steps for each type.
For organizations preparing documents for AI processing, automated scanned PDF redaction eliminates a significant barrier. Documents that couldn't be processed before (because they were "just images") become accessible while sensitive data stays protected.
The quality of your source scans determines the ceiling on redaction quality. But modern OCR handles most business documents effectively, and intelligent PII detection catches sensitive data even when OCR isn't perfect.
Don't let scanned PDFs block your document workflows. The right tools make image-based content as processable as native text.
PaperVeil lets you redact all your sensitive information from pdfs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.