XML Schema (XSD) Generation Guidelines¶
Version: 1.0.0
Target: XML Schema 1.0
Purpose: Generate XSD files from entity specifications
⚠️ CRITICAL RULES FOR LLM AGENTS¶
These rules are MANDATORY and must NEVER be violated:
- 🚫 NEVER use
xsd:any- All elements must be explicitly defined - 🚫 NEVER create chameleon schemas - Always include
targetNamespace - 🚫 NEVER use inheritance to ADD fields - Only use
xsd:restrictionfor stricter validation - 🚫 NEVER use
elementFormDefault="unqualified"- All schemas must useelementFormDefault="qualified"so that every element in an XML instance is namespace-qualified - ✅ USE composition - Define complete types with all fields explicitly
- ✅ USE restriction - For narrowing validation rules only
See detailed explanations in the "Critical Rules" section below.
Alignment with Existing Schemas¶
IMPORTANT: The rut-schemas/ directory contains normative XSD definitions. Generated schemas must:
- Align with existing type definitions in rut-schemas/src/main/xsd/types/
- Import and reuse common types (strings, dates, enums, etc.)
- Follow established naming patterns and structure
- Maintain backward compatibility
Type Mappings¶
Map data types from specs/_meta/data-types.md to XSD types:
type_mappings:
# Primitive Types
string: "xs:string"
text: "xs:string"
integer: "xs:integer"
long: "xs:long"
positive_integer: "xs:positiveInteger"
decimal: "xs:decimal"
boolean: "xs:boolean"
datetime: "xs:dateTime"
date: "xs:date"
# Formatted Types
uuid: "xs:string"
uri: "xs:anyURI"
email: "xs:string"
language_code: "xs:language"
version_string: "xs:string"
# Restricted String Types (use existing types)
short_string: "strings:ShortString"
long_string: "strings:LongString"
identifier_token: "strings:IdentifierToken"
# Date Range Types (use existing types)
date_range: "dates:DateRange"
date_range_open_end: "dates:DateRangeOpenEnd"
# Coordinate Types (use existing types)
coordinates: "coordinates:Coordinates"
# Metadata Types (use existing types)
extraction_metadata: "meta:Meta"
source_id_triple: "sourceId:SourceId3"
# Enumeration Types (use existing types)
dataset_collection_type: "enums:DatasetCollectionType"
personal_data_status: "enums:PersonalDataStatus"
statistical_design_type: "enums:StatisticalDesignType"
Namespace Convention¶
namespaces:
target_namespace: "https://schemas.rutdev.se/xsd/entities/{entity}-{version}.xsd"
common_imports:
strings: "https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
dates: "https://schemas.rutdev.se/xsd/types/dates-1.0.xsd"
enums: "https://schemas.rutdev.se/xsd/types/enums-1.0.xsd"
coordinates: "https://schemas.rutdev.se/xsd/types/coordinates-1.0.xsd"
meta: "https://schemas.rutdev.se/xsd/types/meta-1.0.xsd"
sourceId: "https://schemas.rutdev.se/xsd/types/sourceId-1.0.xsd"
XSD File Template¶
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:tns="{target_namespace}"
xmlns:strings="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
targetNamespace="{target_namespace}"
elementFormDefault="qualified"
version="{entity_version}">
<!--
{EntityName} Schema
Description: {entity_description}
Version: {version}
Generated: {ISO_date}
Source: specs/entities/{level}/{version}/{entity}.md
-->
<!-- Import common type definitions -->
<xsd:import
namespace="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
schemaLocation="../../types/strings-1.1.xsd"
/>
<!-- Additional imports as needed (dates, enums, etc.) -->
<!-- Properties complex type -->
<xsd:complexType name="{EntityName}PropertiesType">
<xsd:sequence>
<!-- Property elements from entity specification -->
<xsd:element name="idAtOrigin" type="strings:IdentifierToken"/>
<!-- Additional properties... -->
</xsd:sequence>
</xsd:complexType>
<!-- Relations complex type (if entity has relationships) -->
<xsd:complexType name="{EntityName}RelationsType">
<xsd:sequence>
<!-- Each relationship from entity specification -->
<!-- Uses strings:EntityReference for all relationship elements -->
<xsd:element name="{relationshipName}" type="strings:EntityReference" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
<!-- Main entity complex type -->
<xsd:complexType name="{EntityName}Type">
<xsd:sequence>
<xsd:element name="properties" type="tns:{EntityName}PropertiesType" minOccurs="0"/>
<xsd:element name="relations" type="tns:{EntityName}RelationsType" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
<!-- Collection type -->
<xsd:complexType name="{EntityName}ListType">
<xsd:sequence>
<xsd:element name="{entityName}" type="tns:{EntityName}Type"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
<!-- Root elements -->
<xsd:element name="{EntityName}" type="tns:{EntityName}Type"/>
<xsd:element name="{EntityName}List" type="tns:{EntityName}ListType"/>
</xsd:schema>
Type Naming Convention¶
| Component | Naming Pattern | Example |
|---|---|---|
| Main entity type | {EntityName}Type |
ConceptType |
| Properties type | {EntityName}PropertiesType |
ConceptPropertiesType |
| Relations type | {EntityName}RelationsType |
ConceptRelationsType |
| Collection type | {EntityName}ListType |
ConceptListType |
Critical Rules for LLM Agents¶
⚠️ MUST NEVER DO¶
- NEVER use
xsd:any xsd:anycreates untyped extension points that break type safety- All elements must be explicitly defined in the schema
-
If extensibility is needed, define explicit extension elements
-
NEVER create "chameleon schemas"
- All schemas MUST have a
targetNamespaceattribute - Chameleon schemas (without targetNamespace) adopt the namespace of importing schema
- This creates ambiguity and maintenance problems
-
Reference: https://en.wiktionary.org/wiki/chameleon_schema
-
NEVER use inheritance to ADD fields to a parent type
- Inheritance via
xsd:extensionMUST NOT add new elements - Parent types define the complete set of fields
- Child types can only restrict or constrain parent fields
✅ CORRECT Use of Inheritance¶
Inheritance via xsd:restriction MAY be used to:
- Enforce stricter validations on parent type fields
- Narrow value ranges (e.g., positive integer instead of integer)
- Restrict string patterns more specifically
- Make optional fields required
- Reduce maxOccurs values
Example - CORRECT:
<!-- Parent type -->
<xsd:simpleType name="IdentifierToken">
<xsd:restriction base="xsd:string">
<xsd:minLength value="1"/>
</xsd:restriction>
</xsd:simpleType>
<!-- Child type - stricter validation -->
<xsd:simpleType name="PopulationId">
<xsd:restriction base="strings:IdentifierToken">
<xsd:pattern value="pop-[a-f0-9-]+"/>
</xsd:restriction>
</xsd:simpleType>
Example - WRONG:
<!-- NEVER DO THIS -->
<xsd:complexType name="ExtendedEntity">
<xsd:complexContent>
<xsd:extension base="tns:BaseEntity">
<!-- DO NOT add new elements via extension -->
<xsd:sequence>
<xsd:element name="newField" type="xsd:string"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
Alternative - CORRECT:
<!-- Define complete type with all fields -->
<xsd:complexType name="Entity">
<xsd:sequence>
<xsd:element name="id" type="tns:EntityId"/>
<xsd:element name="properties" type="tns:EntityProperties"/>
<xsd:element name="relations" type="tns:EntityRelations" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
Generation Rules¶
1. Entity Structure Overview¶
Each entity specification generates three complex types plus a collection type:
<!-- Properties type: contains all data fields -->
<xsd:complexType name="{EntityName}PropertiesType">
<xsd:sequence>
<!-- All properties from entity specification -->
</xsd:sequence>
</xsd:complexType>
<!-- Relations type: contains all relationship references -->
<xsd:complexType name="{EntityName}RelationsType">
<xsd:sequence>
<!-- All relationships from entity specification -->
<!-- Each uses strings:EntityReference type -->
</xsd:sequence>
</xsd:complexType>
<!-- Main entity type: wraps properties and relations -->
<xsd:complexType name="{EntityName}Type">
<xsd:sequence>
<xsd:element name="properties" type="tns:{EntityName}PropertiesType" minOccurs="0"/>
<xsd:element name="relations" type="tns:{EntityName}RelationsType" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
<!-- Collection type: for lists of entities -->
<xsd:complexType name="{EntityName}ListType">
<xsd:sequence>
<xsd:element name="{entityName}" type="tns:{EntityName}Type"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
2. Element Naming¶
Convert camelCase property names to PascalCase for XML elements:
| Spec Property | XML Element |
|---|---|
idAtOrigin |
IdAtOrigin |
eventPeriod |
EventPeriod |
referencePeriod |
ReferencePeriod |
unitType |
UnitType |
3. Properties Type Generation¶
For each property in the entity specification's properties: YAML block:
<xsd:complexType name="{EntityName}PropertiesType">
<xsd:sequence>
<!-- Required property (required: true) -->
<xsd:element name="{propertyName}" type="{xsd_type}"/>
<!-- Optional property (required: false) -->
<xsd:element name="{propertyName}" type="{xsd_type}" minOccurs="0"/>
<!-- Multilingual property (use maxOccurs="unbounded") -->
<xsd:element name="{propertyName}" type="strings:MultilingualShortString"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
The id_at_origin property is always first and always required.
4. Relations Type Generation¶
For each relationship in the entity specification's Mermaid diagram, create an element using the shared strings:EntityReference type:
<xsd:complexType name="{EntityName}RelationsType">
<xsd:sequence>
<!-- Each relationship uses EntityReference -->
<xsd:element name="{relationshipName}" type="strings:EntityReference" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
The strings:EntityReference type contains:
- Zero or more ref elements
- Each ref holds the id_at_origin of a target entity
This structure supports both single references (one ref) and multiple references (many ref elements).
5. Relationship Semantics¶
Key principle: Relationships are always expressed on the source entity as pointers to target entities.
- The source entity's
RelationsTypecontains an element for each outgoing relationship - Each relationship element uses
strings:EntityReferencetype - The
refelements containid_at_originvalues of target entities - Target entities do not need to declare inverse relationships
Example: InstanceVariable has relationships to Population, RepresentedVariable, and Dataset:
<xsd:complexType name="InstanceVariableRelationsType">
<xsd:sequence>
<xsd:element name="isObservationOf" type="strings:EntityReference" minOccurs="0"/>
<xsd:element name="takesMeaningFrom" type="strings:EntityReference" minOccurs="0"/>
<xsd:element name="inDataset" type="strings:EntityReference" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
XML Instance:
<InstanceVariable>
<relations>
<isObservationOf>
<ref>pop-sweden-2024</ref>
</isObservationOf>
<takesMeaningFrom>
<ref>repvar-income-eur</ref>
</takesMeaningFrom>
<inDataset>
<ref>dataset-hbs-2024</ref>
</inDataset>
</relations>
</InstanceVariable>
6. Constraints (Optional)¶
Implement constraints using XSD features when needed:
<!-- Unique constraint on idAtOrigin within a list -->
<xsd:unique name="{entityName}IdAtOriginUnique">
<xsd:selector xpath="tns:{entityName}"/>
<xsd:field xpath="tns:properties/tns:idAtOrigin"/>
</xsd:unique>
Complete Example¶
For the ConceptualVariable entity (which has both properties and relationships):
Entity Specification (specs/entities/level2/conceptual-variable-1.0.md):
- Properties: id_at_origin, name, description
- Relationships: concept → Concept, unit_type → UnitType
Generated XSD:
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:tns="https://schemas.rutdev.se/xsd/entities/level2/conceptual-variable-1.0.xsd"
xmlns:strings="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
targetNamespace="https://schemas.rutdev.se/xsd/entities/level2/conceptual-variable-1.0.xsd"
elementFormDefault="qualified"
version="1.0">
<!--
ConceptualVariable Schema
Description: An abstract concept or characteristic that can be measured,
independent of any specific representation or measurement method.
Version: 1.0
Source: specs/entities/level2/conceptual-variable-1.0.md
-->
<!-- Import common type definitions -->
<xsd:import
namespace="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
schemaLocation="../../types/strings-1.1.xsd"
/>
<!-- Properties type: all data fields from entity specification -->
<xsd:complexType name="ConceptualVariablePropertiesType">
<xsd:sequence>
<xsd:element name="idAtOrigin" type="strings:IdentifierToken">
<xsd:annotation>
<xsd:documentation>
Identifier at the source system. Used during import to determine
if the entity exists (update) or is new (create).
</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="name" type="strings:MultilingualShortString"
minOccurs="0" maxOccurs="unbounded">
<xsd:annotation>
<xsd:documentation>
The name of the conceptual variable (multilingual)
</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="description" type="strings:MultilingualLongString"
minOccurs="0" maxOccurs="unbounded">
<xsd:annotation>
<xsd:documentation>
Detailed conceptual definition of the variable (multilingual)
</xsd:documentation>
</xsd:annotation>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
<!-- Relations type: all relationships use EntityReference -->
<xsd:complexType name="ConceptualVariableRelationsType">
<xsd:sequence>
<xsd:element name="concept" type="strings:EntityReference" minOccurs="0">
<xsd:annotation>
<xsd:documentation>
Reference to the Concept this variable is based on
</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="unitType" type="strings:EntityReference" minOccurs="0">
<xsd:annotation>
<xsd:documentation>
Reference to the UnitType this variable measures
</xsd:documentation>
</xsd:annotation>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
<!-- Main entity type -->
<xsd:complexType name="ConceptualVariableType">
<xsd:sequence>
<xsd:element name="properties" type="tns:ConceptualVariablePropertiesType"
minOccurs="0"/>
<xsd:element name="relations" type="tns:ConceptualVariableRelationsType"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
<!-- Collection type -->
<xsd:complexType name="ConceptualVariableListType">
<xsd:sequence>
<xsd:element name="conceptualVariable" type="tns:ConceptualVariableType"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
<!-- Root elements -->
<xsd:element name="ConceptualVariable" type="tns:ConceptualVariableType"/>
<xsd:element name="ConceptualVariableList" type="tns:ConceptualVariableListType"/>
</xsd:schema>
Example XML Instance:
<?xml version="1.0" encoding="UTF-8"?>
<ConceptualVariable xmlns="https://schemas.rutdev.se/xsd/entities/level2/conceptual-variable-1.0.xsd">
<properties>
<idAtOrigin>cv-annual-income</idAtOrigin>
<name xml:lang="en">Annual Income</name>
<name xml:lang="sv">Årsinkomst</name>
<description xml:lang="en">Total income received during a calendar year</description>
</properties>
<relations>
<concept>
<ref>concept-income</ref>
</concept>
<unitType>
<ref>unit-person</ref>
</unitType>
</relations>
</ConceptualVariable>
Best Practices¶
- Reuse Common Types: Always import and use types from
rut-schemas/src/main/xsd/types/ - Documentation: Include comprehensive annotations
- Versioning: Include version in namespace and filename
- Separation: Separate ID, Properties, and Relations into distinct types
- Collections: Provide collection types for list operations
- Root Elements: Define root elements for both single and collection
- Constraints: Use XSD constraints (unique, key, keyref) where applicable
- Validation: Validate generated XSD against XML Schema spec
- Compatibility: Maintain backward compatibility with existing schemas
- Target Namespace: ALWAYS define targetNamespace (no chameleon schemas)
- Explicit Types: NEVER use xsd:any (define all elements explicitly)
- Composition Over Extension: Use separate types instead of extending base types with new fields
- Restriction Only: Use xsd:restriction for inheritance, never xsd:extension for adding fields
- Qualified Elements: ALWAYS use
elementFormDefault="qualified"— never"unqualified"
Output Location¶
rut-schemas/src/main/xsd/
├── types/ # Common type definitions (existing)
│ ├── strings-1.1.xsd
│ ├── dates-1.0.xsd
│ ├── enums-1.0.xsd
│ └── ...
└── entities/ # Entity schemas (flat structure)
├── population-1.0.xsd
├── universe-1.0.xsd
├── conceptual-variable-1.0.xsd
├── represented-variable-1.0.xsd
├── data-set-1.0.xsd
├── statistical-classification-1.0.xsd
└── organization-1.0.xsd
Integration with Existing Schemas¶
When generating new entity schemas:
1. Check if similar entity already exists in rut-schemas/
2. Reuse existing patterns and structure
3. Import all common types from types/ directory
4. Follow established naming conventions
5. Maintain consistent structure across all entity schemas
6. Add new common types to types/ if needed
Validation Tools¶
After generation:
1. Validate XSD syntax with xmllint
2. Test with sample XML instances
3. Check for circular dependencies
4. Verify namespace consistency
5. Ensure all imports resolve correctly
6. Verify no xsd:any elements (grep for "xsd:any")
7. Verify targetNamespace exists (check schema element)
8. Verify no field-adding extensions (check for xsd:extension with new elements)
Common Anti-Patterns to Avoid¶
❌ Anti-Pattern 1: Using xsd:any¶
<!-- WRONG - Do not use xsd:any -->
<xsd:complexType name="Entity">
<xsd:sequence>
<xsd:element name="id" type="xsd:string"/>
<xsd:any namespace="##any" processContents="lax"/>
</xsd:sequence>
</xsd:complexType>
❌ Anti-Pattern 2: Chameleon Schema¶
<!-- WRONG - Missing targetNamespace -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<!-- This becomes a chameleon schema - DO NOT DO THIS -->
</xsd:schema>
Correct:
<!-- CORRECT - Always include targetNamespace -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="https://schemas.rutdev.se/xsd/entities/entity-1.0.xsd"
elementFormDefault="qualified">
<!-- Well-defined namespace -->
</xsd:schema>
❌ Anti-Pattern 3: Adding Fields via Extension¶
<!-- WRONG - Do not add new fields via extension -->
<xsd:complexType name="ExtendedEntity">
<xsd:complexContent>
<xsd:extension base="tns:BaseEntity">
<xsd:sequence>
<!-- DO NOT add new elements here -->
<xsd:element name="additionalField" type="xsd:string"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
Correct Approach:
<!-- CORRECT - Define complete type or use composition -->
<xsd:complexType name="EntityProperties">
<xsd:sequence>
<xsd:element name="baseField1" type="xsd:string"/>
<xsd:element name="baseField2" type="xsd:string"/>
<xsd:element name="additionalField" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
❌ Anti-Pattern 4: Using elementFormDefault="unqualified"¶
<!-- WRONG - Do not use unqualified -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="https://schemas.rutdev.se/xsd/entities/entity-1.0.xsd"
elementFormDefault="unqualified">
<!-- Local elements will NOT be namespace-qualified — DO NOT DO THIS -->
</xsd:schema>
Correct:
<!-- CORRECT - Always use qualified -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="https://schemas.rutdev.se/xsd/entities/entity-1.0.xsd"
elementFormDefault="qualified">
<!-- All elements are namespace-qualified -->
</xsd:schema>
✅ Correct Pattern: Restriction for Validation¶
<!-- CORRECT - Use restriction for stricter validation -->
<xsd:simpleType name="PositiveQuantity">
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="1"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="SmallPositiveQuantity">
<xsd:restriction base="tns:PositiveQuantity">
<xsd:maxInclusive value="100"/>
</xsd:restriction>
</xsd:simpleType>