A few months ago, I helped a small healthcare company audit their document workflows. They'd been uploading patient intake forms to a cloud service for processing. The forms were PDFs, mostly scanned paper.
"We check them manually," the office manager told me. "We look through each one before uploading."
I asked her to show me the last ten documents they'd processed. In page 15 of document four, buried in a scanned table that looked like boilerplate, there was a Social Security Number. In document seven, the header on every page contained the patient's full name and date of birth. She'd been checking the main content of each form. The headers, footers, and scanned fine print weren't even registering.
This is the problem with manual sensitive data detection: humans aren't built for it. We focus on what looks important. We skim tables. We miss the SSN in paragraph 47. We don't recognize that the string of numbers on the scanned attachment is a credit card number. We're pattern recognition machines, but we're pattern recognition machines tuned for threats and faces, not nine-digit numbers.
Sensitive data detection is the automated identification of PII, confidential information, and regulated content in documents. It's the foundation of document security. You can't protect what you can't find.
Let me show you how this actually works.
The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.
What Counts as Sensitive Data
Sensitive data falls into several categories, each requiring different detection approaches.
Personally Identifiable Information (PII)
Information that identifies specific individuals:
Direct identifiers (highest risk):
- Full legal names
- Social Security numbers
- Driver's license numbers
- Passport numbers
- National identification numbers
- Biometric data
Contact information:
- Email addresses
- Phone numbers
- Physical and mailing addresses
- Social media identifiers
Financial identifiers:
- Bank account numbers
- Credit and debit card numbers
- Financial account identifiers
- Tax identification numbers
Demographic information:
- Date of birth and age
- Gender
- Race and ethnicity
- Medical information
- Religious affiliation
Regulated Data
Data governed by specific regulations:
HIPAA (US Healthcare): 18 specific identifiers including name, dates, contact info, SSN, medical record numbers, health plan numbers, and photos.
GDPR (EU): Any data relating to identified or identifiable persons. Special categories include health, biometric, genetic, racial/ethnic, political, religious, and sexual orientation data.
PCI DSS (Payment Cards): Primary account numbers (PAN), cardholder name when combined with PAN, service codes, expiration dates, and sensitive authentication data.
CCPA (California): Identifiers, commercial information, internet activity, geolocation, professional and employment information, education information.
Business Confidential Data
Organizational information requiring protection:
- Trade secrets and intellectual property
- Financial results and projections
- Strategic plans and analyses
- Customer lists and pricing
- Employee compensation data
- M&A activity
- Legal matters and privileged communications
How Automated Detection Actually Works
Modern sensitive data detection combines multiple techniques. No single approach catches everything.
Named Entity Recognition (NER)
Machine learning models trained to identify entity types in text.
How it works:
- Text is tokenized into words and phrases
- Model analyzes context around each token
- Tokens are classified by entity type
- Confidence scores indicate how sure the model is
Entity types detected:
- PERSON: "John Smith," "Dr. Maria Garcia"
- ORGANIZATION: "Acme Corporation," "Stanford University"
- LOCATION: "123 Main Street," "New York, NY"
- DATE: "March 15, 2024," "last Tuesday"
- MONEY: "$50,000," "fifteen thousand dollars"
Example:
Input: "Contract between Acme Corp and John Smith dated January 15, 2024"
Output:
- "Acme Corp" → ORGANIZATION
- "John Smith" → PERSON
- "January 15, 2024" → DATE
NER models are trained on millions of documents and recognize entities even in unusual contexts or phrasings. "Smith, John" and "J. Smith" and "JOHN SMITH" all get caught as PERSON entities.
Pattern Matching (Regular Expressions)
Regex patterns detect structured data formats:
Social Security Numbers:
Pattern: \b\d{3}-\d{2}-\d{4}\b
Matches: 123-45-6789
Also: \b\d{9}\b (matches 123456789 without dashes)
Credit Card Numbers:
Pattern: \b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b
Matches: 4532-0123-4567-8901, 4532 0123 4567 8901
Email Addresses:
Pattern: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b
Matches: [email protected]
Phone Numbers:
Pattern: \b\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b
Matches: (555) 123-4567, 555-123-4567, 555.123.4567
Pattern matching catches structured data regardless of context. It doesn't care what the document is about. It just finds things that look like SSNs, credit cards, and phone numbers.
Checksum Validation
Some identifiers include check digits that confirm their structure:
Credit cards (Luhn algorithm):
4532-0123-4567-8901
Sum the digits with the Luhn algorithm
If result mod 10 = 0, it's a valid card number format
Social Security Numbers:
- Area numbers 000, 666, and 900-999 are invalid
- Group numbers 00 are invalid
- Serial numbers 0000 are invalid
Checksum validation reduces false positives by confirming that detected patterns are actually valid identifiers, not just random sequences of digits that happen to match the format.
Contextual Analysis
Advanced detection analyzes surrounding text:
Without context: "123-45-6789" could be an SSN, a phone extension, or a case number.
With context:
- "SSN: 123-45-6789" → Confirmed Social Security Number
- "Ext. 123-45-6789" → Probably a phone extension
- "Case #123-45-6789" → Case number, not PII
Context clues include labels and headers ("Social Security Number:"), document type (tax form vs. phone directory), surrounding content patterns, and field names in structured documents.
OCR for Image Content
Documents often contain images with text: scanned pages, embedded photos, screenshots, charts and diagrams with labels.
Detection requires:
- OCR extraction: Convert image pixels to text
- Coordinate mapping: Track where text appears in the image
- Standard detection: Run NER and patterns on extracted text
- Location reference: Map findings back to image coordinates
Modern OCR engines like Tesseract, Google Cloud Vision, and AWS Textract achieve high accuracy on clean scans. They struggle with low resolution images, unusual fonts, handwritten content, and skewed or distorted text.
Metadata Inspection
PDFs contain hidden information beyond visible content:
Document properties:
- Author name and organization
- Creation and modification dates
- Software used
- Title and subject
Embedded content:
- Comments and annotations
- Previous versions
- Attached files
- Hidden layers
Detection must examine both visible content and document structure.
Building a Detection Pipeline
Here's how to implement sensitive data detection in practice.
Architecture
Document Input
↓
[Preprocessing]
├── PDF parsing
├── Text extraction
├── OCR for images
└── Metadata extraction
↓
[Detection Engine]
├── NER models
├── Pattern matching
├── Checksum validation
└── Contextual analysis
↓
[Output]
├── Detection results
├── Location coordinates
├── Confidence scores
└── Category classifications
Step 1: Document Ingestion
Accept documents from various sources: file uploads, email attachments, cloud storage like Google Drive or S3, document management systems.
Normalize to a common format for processing.
Step 2: Content Extraction
Extract all content from the document.
Native PDF text: Use a PDF parsing library to extract text objects. Preserve positioning for coordinate mapping.
Image content: Identify image regions in the document, run OCR to extract text, map text to image coordinates.
Metadata: Extract document properties, parse XML metadata streams, identify embedded objects.
Step 3: Detection Processing
Run detection algorithms on extracted content.
NER pass:
entities = ner_model.predict(text)
for entity in entities:
if entity.type in ['PERSON', 'LOCATION', 'ORGANIZATION']:
add_detection(entity, confidence=entity.score)
Pattern pass:
for pattern in [ssn_pattern, credit_card_pattern, email_pattern, phone_pattern]:
matches = pattern.findall(text)
for match in matches:
if validate_checksum(match):
add_detection(match, type=pattern.type)
Context enhancement:
for detection in detections:
context = get_surrounding_text(detection.location)
detection.confidence = adjust_confidence(detection, context)
detection.category = classify_context(detection, context)
Step 4: Result Aggregation
Compile detection results:
{
"document": "contract.pdf",
"pages": 5,
"detections": [
{
"type": "PERSON",
"value": "John Smith",
"page": 1,
"coordinates": {"x": 120, "y": 340, "w": 130, "h": 25},
"confidence": 0.95,
"context": "Contract between ABC Corp and John Smith"
},
{
"type": "SSN",
"value": "***-**-6789",
"page": 2,
"coordinates": {"x": 200, "y": 560, "w": 120, "h": 20},
"confidence": 0.99,
"context": "Social Security Number: ***-**-6789"
}
],
"summary": {
"total_detections": 15,
"by_type": {
"PERSON": 5,
"SSN": 1,
"EMAIL": 3,
"PHONE": 2,
"ADDRESS": 4
}
}
}
Step 5: Action Routing
Based on detection results, route for appropriate handling:
IF high_risk_detections > 0:
route_to_redaction_queue
ELIF moderate_risk_detections > 0:
flag_for_review
ELSE:
approve_for_processing
Optimizing Detection Accuracy
Reducing False Positives
False positives flag non-sensitive data as sensitive, creating unnecessary work.
Common sources:
- Phone number patterns matching other number sequences
- Name patterns matching product names or titles
- SSN patterns matching case numbers or IDs
Mitigation strategies:
- Checksum validation for structured identifiers
- Context analysis to distinguish patterns
- Allowlists for known non-sensitive terms
- Confidence thresholds for borderline cases
Reducing False Negatives
False negatives miss actual sensitive data. This is the more dangerous failure mode.
Common sources:
- Unusual formatting (SSN without dashes)
- OCR errors mangling patterns
- Uncommon name spellings
- Data split across lines or pages
Mitigation strategies:
- Multiple pattern variants for each data type
- Fuzzy matching for names
- Lower confidence thresholds for high-risk documents
- Human review for critical content
Tuning for Your Documents
Different document types need different configurations.
Contracts:
- High sensitivity for names and organizations
- Custom patterns for party identifiers
- Lower priority for phone and email (often public contact info)
Medical records:
- All 18 HIPAA identifiers
- Medical record number patterns
- Conservative thresholds (miss nothing)
Financial documents:
- Account number patterns (custom to your formats)
- SSN and tax ID priority
- Amount thresholds for confidential figures
Create detection profiles for each document type.
Using Detection for Redaction
Detection alone identifies sensitive data. The next step is acting on it.
Detection → Redaction Workflow
Document uploaded
↓
[Detection Engine]
Identifies all sensitive data instances
↓
[Review Interface] (optional)
User confirms or adjusts detections
↓
[Redaction Engine]
Removes confirmed sensitive data
↓
Sanitized document output
PaperVeil Implementation
PaperVeil combines detection and redaction in one tool.
Detection configuration:
- Toggle PII categories: Person Name, Email Address, Phone Number, SSN, Credit Card, Street Address, Date of Birth
- Add custom regex patterns
- Specify terms to detect (company names, logos)
Processing:
- Automatic text extraction and OCR
- Multi-technique detection (NER plus patterns plus context)
- Coordinate mapping for all detections
Output:
- Redacted PDF with detections removed
- Manifest showing what was found
- Audit trail for compliance
The detection layer powers the redaction capability. You can't redact what you haven't found.
Detection in Enterprise Workflows
Integration Points
Email gateway: Scan incoming attachments for sensitive data before delivery.
Document management: Classify documents by sensitivity at upload time.
Cloud storage: Monitor files for sensitive content, alert on policy violations.
AI preprocessing: Detect and redact before sending to LLM services.
Automation Patterns
Scan and alert:
Document uploaded to shared drive
↓
Detection service scans content
↓
IF sensitive_data_found:
send_alert to security team
apply_access_restrictions
Detect, redact, process:
Document arrives for AI processing
↓
Detection identifies PII
↓
Redaction removes detected items
↓
Clean document sent to LLM
Classification routing:
Document intake
↓
Detection determines sensitivity level
↓
ROUTE based on classification:
- Public → standard processing
- Internal → access logging
- Confidential → approval required
- Restricted → specialized handling
Measuring Detection Effectiveness
Track these metrics to ensure your detection is working:
Coverage Metrics
- Documents scanned vs. total document flow
- Detection categories enabled vs. required
- Document types covered vs. in scope
Accuracy Metrics
- False positive rate (from manual review sampling)
- False negative rate (from periodic deep audits)
- Confidence score distribution
Performance Metrics
- Detection latency (time per document)
- Throughput (documents per hour)
- API availability and error rates
Business Metrics
- Sensitive documents identified before exposure
- Policy violations caught
- Compliance audit success rate
The Bottom Line
Sensitive data detection is the capability that makes data protection possible. You can't redact data you don't know about. You can't protect information you haven't identified. You can't comply with regulations when sensitive data flows through systems undetected.
Modern detection combines Named Entity Recognition for names, organizations, and locations. Pattern matching for structured identifiers like SSN and credit cards. Checksum validation to confirm identifier validity. Contextual analysis to distinguish similar patterns. OCR for text embedded in images.
Implemented well, automated detection catches what humans miss, processes at scale, and provides the foundation for redaction, classification, and access control.
For organizations preparing documents for AI processing, detection is the first step. Find the sensitive data, remove it, and only then send documents to external systems. This workflow: detect, redact, process. It enables AI adoption without data exposure.
The technology exists. The question is whether you'll implement it proactively, or wait until an incident forces the conversation.
PaperVeil lets you redact all your sensitive information from pdfs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.