🤖 Milestone 2: AI-Powered Data Extraction
System Overview
⚡ Key Architecture Principle
Single-Pass Extraction: The system extracts ALL available fields from documents in ONE pass - both financial AND carbon-relevant operational data. Carbon emissions are then CALCULATED using the extracted data as inputs to specialized APIs.
Data Flow Architecture
Any business document
Financial + Operational fields
Use extracted data in APIs
Complete data + emissions
📄 Document Type Intelligence
Document Recognition & Field Extraction
The AI recognizes document type and extracts relevant fields accordingly. Different document types contain different carbon-relevant operational data that must be captured for accurate emissions calculation.
- Vendor, amount, date
- Line items, tax, currency
- Payment method
- Usually none (basic purchase)
- Shipping cost, fees
- Weight: "2,500 kg"
- Route: "Shanghai to Rotterdam"
- Mode: "Maritime freight"
- Container: "40ft refrigerated"
- Distance: "19,550 km" (if stated)
- Total charges, rate schedule
- Electricity: "15,000 kWh"
- Natural Gas: "500 therms"
- Period: "Jan 1-31, 2025"
- Peak/Off-peak: breakdown
- Renewable %: "35%" (if stated)
- Production costs, labor
- Units: "1,000 pieces"
- Materials: "Aluminum 500kg, Steel 300kg"
- Process: "CNC machining"
- Machine hours: "48 hours"
- Waste: "50kg scrap metal"
- Price per gallon, total cost
- Fuel type: "Diesel"
- Quantity: "150 gallons"
- Vehicle/Equipment: "Truck #45"
- Octane/Grade: "87"
- Service description, hours, rate
- Travel: "2 technicians, 50 miles"
- Often minimal carbon data
Field Presence by Document Type
| Field | Invoice | Shipping | Utility | Manufacturing | Fuel |
|---|---|---|---|---|---|
| Vendor/Amount | ✓ | ✓ | ✓ | ✓ | ✓ |
| Weight/Quantity | - | ✓ | - | ✓ | ✓ |
| Transport Route | - | ✓ | - | ? | - |
| Energy Usage | - | ? | ✓ | ✓ | - |
| Materials | - | ? | - | ✓ | - |
| Fuel Type/Amount | - | ? | ? | ? | ✓ |
🎯 What We Extract vs What We Calculate
Critical Distinction
We EXTRACT actual operational data that exists in documents. We then CALCULATE emissions using this extracted data as inputs to specialized carbon APIs. We don't try to "guess" or "enrich" - we extract what's there, then calculate what's not.
| Data Field | Source | Example from Document | Calculated Result |
|---|---|---|---|
| Shipping Weight | EXTRACTED | "Gross Weight: 2,500 kg" | Used as input for emissions |
| Transport Route | EXTRACTED | "Shanghai to Rotterdam" | Used to calculate distance |
| Distance | CALCULATED | Not in document | 19,550 km (via API) |
| Transport Mode | EXTRACTED | "Via: Maritime Freight" | Used for emission factor |
| CO2 Emissions | CALCULATED | Never in document | 125 kg CO2e (calculated) |
| Electricity Usage | EXTRACTED | "Total kWh: 15,000" | Input for Scope 2 |
| Grid Emission Factor | ENRICHED | Not in document | 0.5 kg CO2/kWh (EPA API) |
| Material Type | EXTRACTED | "Aluminum Alloy 6061" | Used for lifecycle query |
| Material Quantity | EXTRACTED | "500 kg aluminum" | Input for material emissions |
| Material Emissions | CALCULATED | Never in document | 4,500 kg CO2e (Ecoinvent) |
const extractedData = {
// EXTRACTED from shipping document
vendor: "Maersk Shipping",
totalAmount: 25000,
weight: 2500, // Found in document: "2,500 kg"
weightUnit: "kg", // Found in document
origin: "Shanghai", // Found in document
destination: "Rotterdam", // Found in document
transportMode: "maritime", // Found in document
// NOT in document - will be CALCULATED
distance: null, // Calculate via routing API
emissions: null, // Calculate via emissions API
emissionFactor: null // Get from EPA/Ecoinvent
};
// Then call APIs with extracted data
const emissions = await calculateEmissions(extractedData);
📊 Complete Data Field Mapping
Fields We Extract from Documents
| Field Name | Data Type | Source | Document Types | Purpose |
|---|---|---|---|---|
| vendor | String | EXTRACTED | All | Vendor identification |
| totalAmount | Numeric | EXTRACTED | All | Financial tracking |
| weight | Numeric + Unit | EXTRACTED | Shipping, Manufacturing | Transport emissions input |
| origin/destination | String | EXTRACTED | Shipping | Route calculation input |
| transportMode | Enum | EXTRACTED | Shipping | Emission factor selection |
| electricityUsage | Numeric (kWh) | EXTRACTED | Utility | Scope 2 emissions input |
| gasUsage | Numeric (therms) | EXTRACTED | Utility | Scope 1 emissions input |
| materialType | Array | EXTRACTED | Manufacturing | Material lifecycle input |
| materialQuantity | Numeric + Unit | EXTRACTED | Manufacturing | Material emissions input |
| fuelType | Enum | EXTRACTED | Fuel | Combustion factor input |
| fuelQuantity | Numeric + Unit | EXTRACTED | Fuel | Direct emissions input |
| productUnits | Numeric | EXTRACTED | Manufacturing | Per-unit carbon calculation |
| wasteGenerated | Numeric + Type | EXTRACTED | Manufacturing | Waste emissions input |
Fields We Calculate Using APIs
| Field Name | Data Type | Source | API/Service | Inputs Required |
|---|---|---|---|---|
| distance | Numeric (km) | CALCULATED | Routing API | origin, destination |
| transportEmissions | Numeric (kg CO2e) | CALCULATED | Transport Calculator | weight, distance, mode |
| scope2Emissions | Numeric (kg CO2e) | CALCULATED | EPA Energy API | kWh, location, grid mix |
| scope1Emissions | Numeric (kg CO2e) | CALCULATED | EPA Factors | fuel type, quantity |
| materialEmissions | Numeric (kg CO2e) | CALCULATED | Ecoinvent | material type, quantity |
| emissionFactor | Float | ENRICHED | Multiple APIs | activity type, location |
| totalEmissions | Numeric (kg CO2e) | CALCULATED | Aggregation | All scope emissions |
| carbonIntensity | Float | CALCULATED | Calculation | emissions / units |
| blockchainHash | String | CALCULATED | Blockchain API | All carbon data |
📋 Enhanced Process Flow
Speed: 50ms
Speed: 2-8 seconds
Speed: 10ms
Speed: 20ms
Speed: 200-500ms
Speed: 100ms
Speed: 50ms
Processing Pipeline Performance
🎯 Document-Aware Extraction Strategy
LangChain Prompt Engineering
The extraction prompt adapts based on document type, ensuring we capture all relevant operational data that will be needed for carbon calculations.
const extractionPrompts = {
shipping: `
Extract ALL of the following fields:
FINANCIAL: vendor, amount, currency, date
OPERATIONAL (for carbon calculation):
- weight and unit (REQUIRED)
- origin city/port (REQUIRED)
- destination city/port (REQUIRED)
- transport_mode (sea/air/road/rail)
- container_type (if mentioned)
- vessel/flight/truck number
- distance (if stated)
Return NULL for fields not found.`,
utility: `
Extract ALL of the following fields:
FINANCIAL: vendor, amount, currency, billing_period
OPERATIONAL (for carbon calculation):
- electricity_kwh (REQUIRED if electric bill)
- gas_therms or gas_cubic_meters (if gas bill)
- water_gallons or water_liters (if water bill)
- peak_usage_kwh (if mentioned)
- off_peak_usage_kwh (if mentioned)
- renewable_percentage (if mentioned)
- service_address (for grid factors)
Return NULL for fields not found.`,
manufacturing: `
Extract ALL of the following fields:
FINANCIAL: vendor, amount, currency, date
OPERATIONAL (for carbon calculation):
- product_units (quantity produced)
- materials_list (array of materials)
- material_quantities (with units)
- process_type (machining/molding/etc)
- machine_hours (if mentioned)
- energy_consumed_kwh (if mentioned)
- waste_type and waste_quantity
- scrap_percentage (if mentioned)
Return NULL for fields not found.`,
default: `
Extract standard financial fields:
vendor, amount, currency, date, line_items, tax, payment_method
Also check for ANY operational data:
quantities, weights, distances, energy usage, materials`
};
// Use in n8n-nodes-langchain
const documentType = detectDocumentType(extractedText);
const prompt = extractionPrompts[documentType] || extractionPrompts.default;
✅ Key Implementation Points
- Always Extract Available Data: Never skip operational fields even if they seem unrelated
- Mark Required vs Optional: Some fields are critical for carbon calculation
- Return NULL for Missing: Don't guess or estimate missing fields
- Single Pass Only: Extract everything once, calculate later
- Preserve Units: Always capture units (kg, miles, kWh, etc.)
- Structured Output: Use JSON mode for consistent field extraction
🔧 System Components
Core Extraction Components
Carbon Calculation Components (Post-Extraction)
📈 Performance Metrics
Extraction Accuracy by Document Type
📊 Why This Approach Works
- Data Exists: Shipping docs really contain weights and routes
- Single Pass: More efficient than multiple extraction attempts
- Accurate Inputs: Real data produces accurate carbon calculations
- No Guessing: We extract what's there, calculate what's not
- API Ready: Extracted data directly feeds carbon APIs
- Audit Trail: Clear distinction between extracted vs calculated
💡 Business Value
- Accurate Carbon Tracking: Based on real operational data, not estimates
- Compliance Ready: Actual data meets regulatory requirements
- Supply Chain Visibility: Track emissions at transaction level
- No Additional Data Entry: Extract from existing documents
- Immediate ROI: Avoid carbon taxes with accurate reporting
- Competitive Advantage: Real carbon passports vs estimates