Automated FGDC CSDGM to ISO 19115 Conversion Pipeline Jump to heading
Converting FGDC CSDGM to ISO 19115 automatically requires deterministic schema translation rather than heuristic parsing. Legacy FGDC XML structures lack the strict cardinality, temporal precision, and controlled vocabulary constraints mandated by ISO 19115-2003 and ISO 19115-1:2014. Direct mapping pipelines routinely fail when encountering unstructured <idinfo> blocks, ambiguous coordinate reference system declarations, or missing lineage sequencing. This guide provides a production-ready Python ETL configuration that enforces strict type coercion, resolves schema drift, and guarantees compliance through automated validation and fallback routing. For foundational context on spatial metadata interoperability, review the Geospatial Schema Architecture & Standards Mapping framework before deploying this pipeline.
Declarative Mapping Configuration Jump to heading
The conversion engine must operate on a declarative translation matrix to prevent ad-hoc XPath mutations during execution. Define a YAML-driven mapping layer that explicitly binds FGDC elements to ISO 19115 equivalents while enforcing mandatory field population. The following minimal reproducible configuration establishes the baseline translation rules:
mapping_matrix:
idinfo/citation/citeinfo/title: identificationInfo/MD_DataIdentification/citation/CI_Citation/title
idinfo/citation/citeinfo/pubdate: identificationInfo/MD_DataIdentification/citation/CI_Citation/date/CI_Date/date
idinfo/citation/citeinfo/geoform: identificationInfo/MD_DataIdentification/spatialRepresentationType/MD_SpatialRepresentationTypeCode
dataqual/lineage/procstep/procdate: dataQualityInfo/DQ_DataQuality/lineage/LI_Lineage/processStep/LI_ProcessStep/dateTime/CI_Date/date
spref/horizsys/geodetic/horizdn: referenceSystemInfo/MD_ReferenceSystem/referenceSystemIdentifier/RS_Identifier/code
Implement this matrix using lxml.etree with a deterministic traversal order. Avoid recursive wildcard searches (//) as they introduce non-deterministic node selection and break batch processing consistency. The Python implementation below enforces strict path resolution and handles missing source nodes gracefully:
import yaml
from lxml import etree
from typing import Dict, Optional, Tuple
def load_mapping_config(config_path: str) -> Dict[str, str]:
with open(config_path, "r") as f:
return yaml.safe_load(f)["mapping_matrix"]
def translate_node(
source_tree: etree._ElementTree,
fgdc_path: str,
iso_path: str
) -> Optional[Tuple[str, str]]:
"""Extracts FGDC value and returns ISO target path + value."""
node = source_tree.find(fgdc_path)
if node is None or not node.text or not node.text.strip():
return None
return iso_path, node.text.strip()
Precision Management and Temporal Coercion Jump to heading
Precision loss during coordinate and temporal conversion is the primary failure vector in automated pipelines. FGDC stores dates as YYYYMMDD or YYYY strings, while ISO 19115 requires strict YYYY-MM-DD ISO 8601 formatting with explicit time zone offsets. Implement a regex-based temporal parser with a hard fallback threshold. For spatial bounding boxes, FGDC uses <westbc>, <eastbc>, <northbc>, and <southbc>. ISO 19115 requires EX_GeographicBoundingBox with explicit longitude/latitude tags.
Apply the following thresholds and rules:
- Temporal Precision: If temporal precision drops below day-level, inject
00:00:00Zand append a<gco:CharacterString>provenance note indicating automated normalization. - Coordinate Precision: Enforce a 6-decimal precision threshold during float coercion to prevent floating-point drift that corrupts downstream spatial indexing.
- Rounding Strategy: Values exceeding the threshold must be rounded using
decimal.ROUND_HALF_EVENto maintain IEEE 754 compliance. - Null Handling: Missing bounding box coordinates trigger a fallback to the dataset’s declared CRS extent or raise a
ValueErrorif no authoritative extent exists.
import re
from decimal import Decimal, ROUND_HALF_EVEN
from datetime import datetime
DATE_PATTERN = re.compile(r"^(?P<year>\d{4})(?P<month>\d{2})?(?P<day>\d{2})?$")
def coerce_date(raw_date: str) -> str:
match = DATE_PATTERN.match(raw_date.strip())
if not match:
raise ValueError(f"Invalid FGDC date format: {raw_date}")
year = match.group("year")
month = match.group("month") or "01"
day = match.group("day") or "01"
iso_date = f"{year}-{month}-{day}"
try:
datetime.strptime(iso_date, "%Y-%m-%d")
except ValueError:
raise ValueError(f"Invalid calendar date: {iso_date}")
return f"{iso_date}T00:00:00Z"
def coerce_coordinate(val: str) -> str:
d = Decimal(val.strip())
return str(d.quantize(Decimal("0.000001"), rounding=ROUND_HALF_EVEN))
Schema Drift Debugging and Fallback Routing Jump to heading
Schema drift occurs when FGDC profiles omit mandatory elements or use deprecated vocabulary codes. Production pipelines must implement a quarantine-and-retry architecture rather than hard-failing. When FGDC Metadata Mapping profiles diverge from the baseline, route non-compliant records to a staging directory with structured error manifests.
Handle the following edge cases explicitly:
- Missing Fields: If a mandatory ISO 19115 element lacks a direct FGDC equivalent, inject a
<gco:nilReason>attribute with the valuemissingorunknownper ISO 19115-1:2014 Section 6.2. - CRS Mismatches: FGDC often declares horizontal datums via free-text strings. Resolve these using an authoritative EPSG registry lookup. If resolution fails, default to
EPSG:4326and log aCRS_AMBIGUOUSwarning. - CI Pipeline Failures: Integrate schema validation into your CI/CD workflow. Fail fast on XML well-formedness errors, but allow soft-failures on optional metadata blocks to prevent deployment bottlenecks.
Automated Validation and CI Integration Jump to heading
Post-conversion validation must occur before data publication. Use lxml with the official ISO 19115 XML Schema Definition (XSD) and Schematron rules to enforce business logic constraints. Embed validation directly into your CI pipeline using pytest and a custom fixture that asserts zero critical violations. Refer to the official ISO 19115-1:2014 Specification for authoritative schema definitions.
import subprocess
from pathlib import Path
def validate_iso19115(xml_path: Path, xsd_path: Path) -> bool:
"""Validates ISO 19115 XML against official XSD using xmllint."""
result = subprocess.run(
["xmllint", "--noout", "--schema", str(xsd_path), str(xml_path)],
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"Schema Validation Failed:\n{result.stderr}")
return False
return True
Configure your CI runner to:
- Run
validate_iso19115()on every pull request touching the metadata directory. - Block merges if validation returns
Falseor if the error log containsCRITICALseverity tags. - Archive failed XML payloads for manual review by GIS data stewards.
This pipeline architecture guarantees deterministic output, preserves spatial and temporal precision, and aligns with federal geospatial interoperability mandates.