You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement automatic language detection for uploaded PDF documents and dynamically apply the appropriate locale settings to Adobe PDF Services API calls for optimal processing results across multiple languages.
🎯 Motivation
Currently, the PDF accessibility processing pipeline uses a hardcoded English locale (en-US) for all documents. This limits the effectiveness of Adobe PDF Services' autotagging and extraction capabilities for non-English documents, particularly for languages like Spanish, Catalan, French, German, and others that have specific linguistic rules and accessibility requirements.
✨ Features Implemented
1. Automatic Language Detection
AWS Comprehend Integration: Utilizes AWS Comprehend's DetectDominantLanguage API to analyze document content
Smart Text Sampling: Extracts text from the first 5 pages of the PDF for language analysis
Confidence Thresholding: Only applies detected language if confidence score ≥ 70%
Graceful Fallbacks: Defaults to English (en-US) for low-confidence detections or errors
2. Comprehensive Language Support
Supports 30+ languages with proper locale mapping:
Language
AWS Code
Adobe Locale
Region
English
en
en-US
United States
Spanish
es
es-ES
Spain
Catalan
ca
ca-ES
Spain
French
fr
fr-FR
France
German
de
de-DE
Germany
Italian
it
it-IT
Italy
Portuguese
pt
pt-BR
Brazil
Japanese
ja
ja-JP
Japan
Chinese
zh
zh-CN
China (Simplified)
And 20+ more...
3. Integrated Processing Pipeline
Autotagging: Applies detected locale to AutotagPDFParams for language-aware accessibility tagging
Text Extraction: Uses detected locale in ExtractPDFParams for improved text and table extraction
PDF Metadata: Sets document language metadata consistently across the pipeline
4. Enhanced Error Handling & Logging
Comprehensive logging of detection process and confidence scores
Handles AWS Comprehend API limits (5000 bytes max text)
Manages insufficient text scenarios gracefully
Detailed error reporting for troubleshooting
🔧 Technical Implementation
Core Components Added:
1. Language Detection Function
defdetect_document_language(pdf_path, filename):
""" Detect the dominant language in a PDF document using AWS Comprehend. Returns Adobe PDF Services locale code (e.g., 'es-ES', 'ca-ES', 'en-US') """
2. Updated API Functions
autotag_pdf_with_options() - Now accepts detected_locale parameter
extract_api() - Now accepts detected_locale parameter
set_language_comprehend() - Enhanced to use detected locale for PDF metadata
3. Language-to-Locale Mapping
Comprehensive mapping dictionary from AWS Comprehend language codes to Adobe PDF Services locale codes.
Infrastructure Changes:
AWS CDK Updates (app.py):
IAM Permissions: Added comprehend:DetectDominantLanguage permission to ECS task role
Backward Compatibility: Maintains support for manual locale override via environment variable
📊 Processing Flow
graph TD
A[PDF Upload] --> B[Download from S3]
B --> C[Extract Text from First 5 Pages]
C --> D[AWS Comprehend Language Detection]
D --> E{Confidence ≥ 70%?}
E -->|Yes| F[Map to Adobe Locale]
E -->|No| G[Default to en-US]
F --> H[Apply Locale to Adobe APIs]
G --> H
H --> I[Autotagging with Locale]
H --> J[Text Extraction with Locale]
I --> K[Set PDF Language Metadata]
J --> K
K --> L[Upload Processed PDF]
Loading
🧪 Testing Scenarios
Test Cases to Validate:
Spanish Documents: Verify es-ES locale detection and application
Catalan Documents: Verify ca-ES locale detection and application
Mixed Language Documents: Test confidence thresholding
Scanned/Image PDFs: Handle insufficient text scenarios
Very Short Documents: Test minimum text requirements
Error Scenarios: AWS Comprehend API failures, network issues
Backward Compatibility: Manual locale override still works
Expected Improvements:
Better Accessibility Tagging: Language-specific heading detection and structure analysis
Improved Text Extraction: Better handling of language-specific characters and formatting
Enhanced Metadata: Proper language metadata in final PDF documents
Compliance: Better WCAG 2.1 compliance for non-English documents
📈 Benefits
For Users:
Automatic Processing: No manual language configuration required
Summary
Implement automatic language detection for uploaded PDF documents and dynamically apply the appropriate locale settings to Adobe PDF Services API calls for optimal processing results across multiple languages.
🎯 Motivation
Currently, the PDF accessibility processing pipeline uses a hardcoded English locale (
en-US) for all documents. This limits the effectiveness of Adobe PDF Services' autotagging and extraction capabilities for non-English documents, particularly for languages like Spanish, Catalan, French, German, and others that have specific linguistic rules and accessibility requirements.✨ Features Implemented
1. Automatic Language Detection
DetectDominantLanguageAPI to analyze document contenten-US) for low-confidence detections or errors2. Comprehensive Language Support
Supports 30+ languages with proper locale mapping:
enen-USeses-EScaca-ESfrfr-FRdede-DEitit-ITptpt-BRjaja-JPzhzh-CN3. Integrated Processing Pipeline
AutotagPDFParamsfor language-aware accessibility taggingExtractPDFParamsfor improved text and table extraction4. Enhanced Error Handling & Logging
🔧 Technical Implementation
Core Components Added:
1. Language Detection Function
2. Updated API Functions
autotag_pdf_with_options()- Now acceptsdetected_localeparameterextract_api()- Now acceptsdetected_localeparameterset_language_comprehend()- Enhanced to use detected locale for PDF metadata3. Language-to-Locale Mapping
Comprehensive mapping dictionary from AWS Comprehend language codes to Adobe PDF Services locale codes.
Infrastructure Changes:
AWS CDK Updates (
app.py):comprehend:DetectDominantLanguagepermission to ECS task rolePDF_LOCALEenvironment variable📊 Processing Flow
graph TD A[PDF Upload] --> B[Download from S3] B --> C[Extract Text from First 5 Pages] C --> D[AWS Comprehend Language Detection] D --> E{Confidence ≥ 70%?} E -->|Yes| F[Map to Adobe Locale] E -->|No| G[Default to en-US] F --> H[Apply Locale to Adobe APIs] G --> H H --> I[Autotagging with Locale] H --> J[Text Extraction with Locale] I --> K[Set PDF Language Metadata] J --> K K --> L[Upload Processed PDF]🧪 Testing Scenarios
Test Cases to Validate:
es-ESlocale detection and applicationca-ESlocale detection and applicationExpected Improvements:
📈 Benefits
For Users:
For Developers:
🔍 Monitoring & Observability
Key Metrics to Track:
Log Messages Added:
Detected language: {code} (confidence: {score})Using locale for autotagging: {locale}Using locale for extraction: {locale}Language set to {code} (from detected locale: {locale})🚀 Deployment Notes
Prerequisites:
Rollback Plan:
PDF_LOCALEenvironment variable to force specific localePDF_LOCALE=en-US🔮 Future Enhancements
Potential Improvements:
📝 Files Modified
Core Changes:
docker_autotag/autotag.py: Added language detection and locale parametrizationapp.py: Updated IAM permissions and removed hardcoded localeKey Functions Added/Modified:
detect_document_language()- New function for language detectionautotag_pdf_with_options()- Added locale parameterextract_api()- Added locale parameterset_language_comprehend()- Enhanced with locale supportmain()- Integrated language detection workflow🏷️ Labels
enhancementlanguage-supportaws-comprehendadobe-pdf-servicesaccessibilityinternationalizationi18n🔗 Related Issues
Priority: High
Complexity: Medium
Impact: High - Significantly improves processing quality for non-English documents