In 2011, the Transportation Security Administration published a "redacted" airport security manual. They'd drawn black boxes over the sensitive parts using what looked like the highlighting tool in their PDF editor.
Someone copied and pasted the text. All of it. The full content was sitting right there under the black boxes. Sensitive security procedures for every US airport became public because someone thought "black box over text" equals "redacted."
In 2014, a law firm filed a court document with "redacted" financial figures. Same technique: black highlighting. Opposing counsel extracted the hidden numbers in about thirty seconds.
This happens constantly because most people don't understand what redaction actually means. They think they've removed sensitive information. They've actually just covered it with a digital sticker that anyone can peel off.
If you're about to upload a document to ChatGPT or Claude, and you think you've "redacted" it by drawing black boxes, you might want to keep reading.
The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.
What Redaction Actually Means
Let me be very clear about something: there's a difference between hiding content and removing it.
Redaction is NOT:
- Drawing black boxes over text with annotation tools
- Changing text color to match the background
- Covering text with images or shapes
- Using the highlighter tool in black
All of these methods leave the original text sitting in the PDF file. The visual layer hides it from your eyes. The data layer keeps it perfectly intact. Anyone with basic PDF knowledge can recover it.
Redaction IS:
- Permanently removing text and image content from the PDF
- Replacing removed content with solid fills (the black boxes you see)
- Eliminating the underlying data from the file structure
- Irreversible: the original content cannot be recovered
True redaction modifies the PDF itself, not just how it looks. When done correctly, the sensitive data is gone. Not hidden. Gone.
Why This Matters for AI
When you upload a document to ChatGPT, Claude, or any LLM:
- The document leaves your network
- It's transmitted to the AI provider's servers
- It may be stored (at least temporarily)
- It might be used for training (depending on settings)
- You can't take it back
For documents containing Social Security numbers, client names, medical information, or confidential business data, this creates real problems:
- HIPAA violations for health data
- GDPR violations for EU personal data
- Potential privilege waiver for legal content
- Breach of confidentiality for client information
- Competitive exposure for proprietary content
If you think you've redacted a document but you actually just drew boxes on it, you've sent all that sensitive data to an external server with a false sense of security.
Method 1: Adobe Acrobat Pro (The Way Most People Do It)
Adobe Acrobat Pro DC includes dedicated redaction tools. Here's how to use them correctly.
Step 1: Open the Redaction Tool
- Open your PDF in Acrobat Pro DC
- Go to Tools → Redact
- The redaction toolbar appears at the top
Step 2: Mark Content for Redaction
For text:
- Click "Mark for Redaction"
- Click and drag to select text
- Selected content gets a red overlay (not yet redacted)
For images:
- Click "Mark for Redaction"
- Draw a rectangle around image areas
- Entire regions within the rectangle will be removed
For patterns like SSN or phone numbers:
- Click "Mark for Redaction" → "Find Text"
- Use "Patterns" to find SSNs, phone numbers, email addresses
- Select all matches and mark for redaction
Step 3: Apply Redaction
- Click "Apply" in the toolbar
- Acrobat warns that this action is permanent
- Confirm to permanently remove marked content
- Save the file with a new name (keep your original)
Step 4: Don't Skip This Part
Remove Hidden Information. This is where most people fail.
- Go to Tools → Redact → "Remove Hidden Information"
- This strips metadata, hidden layers, comments, and embedded data
- Without this step, sensitive info might still be lurking in the file structure
The Problems with Acrobat
While Acrobat Pro is the industry standard, it has real usability issues:
It's painfully slow. Each file must be opened, marked, applied, cleaned, and saved individually. Processing 20 documents takes an hour of clicking.
Pattern detection is limited. Built-in patterns cover SSN, phone, and email. Custom patterns require regex knowledge that most people don't have.
No batch processing. You can't select a folder of PDFs and redact them automatically. Every document needs manual attention.
Scanned PDFs need OCR first. For image-based PDFs, you must run text recognition, then redact, hoping the OCR caught everything.
Cost. Acrobat Pro DC runs $22.99/month. For occasional use, that's expensive.
The interface is confusing. First-time users regularly use annotation tools instead of redaction tools and end up with fake redaction.
Method 2: Free Online Tools (Please Don't)
Several online services offer free PDF redaction: Smallpdf, PDF24, Sejda.
Here's the problem: you're uploading sensitive documents to a third party in order to protect them from third parties.
When you upload a PDF to Smallpdf for redaction, that document travels to their servers. You're trusting their data handling before you've even removed the sensitive information.
For documents sensitive enough to need redaction before AI processing, uploading the un-redacted version to a different cloud service defeats the entire purpose.
Also:
- Quality varies wildly. Some don't do true redaction, just visual overlay.
- File size limits on free tiers
- No pattern detection. Manual selection only.
- No way to verify the redaction actually worked
If your document truly contains sensitive information, free online tools aren't the answer.
Method 3: Desktop Applications
Several desktop apps handle redaction without cloud uploads:
PDF-XChange Editor: Windows, one-time purchase ($56), includes redaction tools.
Foxit PDF Editor: Cross-platform, $149/year subscription, enterprise features.
LibreOffice Draw: Free and open source. Warning: NOT true redaction. The underlying text remains.
Preview (Mac): Built into macOS. Warning: Also NOT true redaction. Don't trust it.
The desktop approach keeps files local but typically has the same usability problems as Acrobat: manual selection, no batch processing, limited pattern detection.
Method 4: Automated Redaction (The Modern Approach)
The limitations of traditional redaction become unworkable when you're processing documents regularly for AI workflows.
Automated redaction tools take a different approach:
- Upload the document (single file or batch)
- Select what to detect (PII types, patterns, custom terms)
- Execute redaction (automatic detection and removal)
- Download clean file (ready for AI processing)
What Automated Detection Covers
PII Categories:
- Person names (detected through machine learning models)
- Email addresses
- Phone numbers (multiple formats)
- Social Security Numbers
- Credit card numbers
- Street addresses
- Dates of birth
Pattern Matching:
- Custom regex patterns (your account number formats, internal IDs)
- Company names and logos
- Specific terms or phrases you define
Mixed Content:
- Text layers in native PDFs
- OCR for scanned documents
- Text embedded in images
- Handwritten content (with limitations)
PaperVeil: Built for This
PaperVeil is a redaction tool designed specifically for preparing documents for LLM processing.
The interface is straightforward:
-
Upload your PDF. Drag and drop or click to select.
-
Choose what to redact:
- Toggle PII types: Person Name, Email Address, Phone Number, Social Security Number, Credit Card Number, Street Address, Date of Birth
- Add custom text or regex patterns
- Specify logo text to remove (company names in headers/footers)
-
Execute Redaction. Click the button.
-
Review the Output Manifest. See exactly what was detected and removed.
-
Download the clean PDF. Ready for ChatGPT, Claude, or any LLM.
Why This Works Better
Speed: What takes 30 minutes manually takes 30 seconds automatically. Upload, click, done.
Consistency: Every document processed the same way. No "I forgot to check that section" errors.
Coverage: Automated detection catches things humans miss. The SSN in the footer. The email in the signature block. The phone number in the scanned letterhead.
Auditability: The output manifest shows exactly what was found and removed. You have a record for compliance.
Mixed media: Scanned documents, image-based PDFs, and documents with embedded graphics all get processed correctly.
Choosing the Right Method
| Situation | Recommended Approach |
|---|---|
| One-off document, minimal sensitive data | Adobe Acrobat Pro (if you have it) |
| Occasional use, budget-conscious | Desktop app like PDF-XChange |
| Regular AI workflow, multiple documents | Automated tool (PaperVeil) |
| Highly sensitive documents | Automated + manual review |
| Documents with scanned content | Automated with OCR capability |
For most people preparing documents for AI analysis, automated redaction provides the best balance of security, speed, and reliability.
The Complete Workflow: Redact, Then AI
Let me walk through the complete process:
Step 1: Assess Your Document
Before redacting, identify what needs to go:
- Personal information: Names, contact details, IDs
- Financial data: Account numbers, amounts (if sensitive)
- Company identifiers: Names of parties in contracts
- Dates: If they identify specific individuals
- Custom data: Industry-specific identifiers
Step 2: Choose Detection Settings
For a typical contract going to AI for summarization:
Enable:
- Person Name
- Email Address
- Phone Number
- Street Address
Custom patterns:
- Company names in the agreement
- Case numbers or reference IDs
Step 3: Process the Document
Upload to your redaction tool and run detection. Review what was found:
Detection Results:
- 12 person names found
- 3 email addresses found
- 2 phone numbers found
- 4 street addresses found
- "Acme Corporation" found 8 times
Step 4: Verify the Output
Open the redacted PDF and spot-check:
- Are black boxes where expected?
- Does the document still make sense?
- Is context preserved for AI analysis?
Step 5: Upload to AI
Your document is now safe for LLM processing. The AI receives something like:
"Agreement between [COMPANY] and [PERSON], dated [DATE], for provision of consulting services at [ADDRESS]. Payment terms: Net 30. Total contract value: $50,000..."
The LLM can summarize the agreement, extract key terms, answer questions about content, and compare against other documents.
But it can't identify the parties, link to real individuals, or expose confidential relationships.
Mistakes to Avoid
Using Annotation Instead of Redaction
Drawing black boxes with comment tools doesn't remove underlying text. Always use dedicated redaction features and verify by trying to select text under the boxes.
Forgetting Metadata
PDFs contain author names, organization info, creation dates, and revision history. Run "Remove Hidden Information" after redacting visible content.
Missing Headers and Footers
Letterhead, page footers, and running headers often contain company names and contact info. Easy to overlook when focusing on body text.
Ignoring Image-Based Content
Scanned documents and embedded images require OCR-based redaction. Standard text redaction won't touch them.
Over-Redacting
Removing too much makes documents useless for analysis. Redact identifying information, but keep context the AI needs.
Not Keeping Originals
Always save the original unredacted document securely. Redaction is irreversible. You may need originals for legal or business purposes.
Integration with AI Workflows
For organizations processing documents regularly, redaction should be part of the pipeline:
Document Received (email, upload, API)
↓
[Automated Redaction]
↓
Sanitized Document
↓
[LLM Processing]
- Summarization
- Extraction
- Classification
↓
Results Delivered
This can be automated with tools like n8n or Zapier:
- Trigger: New email with PDF attachment
- Action 1: Send PDF to PaperVeil API for redaction
- Action 2: Send redacted PDF to OpenAI/Claude for analysis
- Action 3: Deliver results to user or downstream system
The manual step of opening each PDF and redacting disappears. Documents flow through the pipeline automatically.
The Bottom Line
PDF redaction isn't complicated once you understand the fundamentals:
- True redaction removes data, not just hides it
- Traditional tools work but require manual effort for each document
- Automated solutions handle detection, batch processing, and mixed content
- The workflow matters because redaction should be a step in your pipeline, not a separate task
For occasional use, Adobe Acrobat Pro gets the job done. For regular AI document processing, automated redaction saves hours while reducing the risk of missed sensitive data.
Your documents have information AI can help with. With proper redaction, that information flows to the AI while the sensitive details stay protected.
PaperVeil lets you redact all your sensitive information from pdfs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.