Skip to content

XML Schema (XSD) Generation Guidelines

Version: 1.0.0
Target: XML Schema 1.0
Purpose: Generate XSD files from entity specifications

⚠️ CRITICAL RULES FOR LLM AGENTS

These rules are MANDATORY and must NEVER be violated:

  1. 🚫 NEVER use xsd:any - All elements must be explicitly defined
  2. 🚫 NEVER create chameleon schemas - Always include targetNamespace
  3. 🚫 NEVER use inheritance to ADD fields - Only use xsd:restriction for stricter validation
  4. 🚫 NEVER use elementFormDefault="unqualified" - All schemas must use elementFormDefault="qualified" so that every element in an XML instance is namespace-qualified
  5. ✅ USE composition - Define complete types with all fields explicitly
  6. ✅ USE restriction - For narrowing validation rules only

See detailed explanations in the "Critical Rules" section below.


Alignment with Existing Schemas

IMPORTANT: The rut-schemas/ directory contains normative XSD definitions. Generated schemas must: - Align with existing type definitions in rut-schemas/src/main/xsd/types/ - Import and reuse common types (strings, dates, enums, etc.) - Follow established naming patterns and structure - Maintain backward compatibility

Type Mappings

Map data types from specs/_meta/data-types.md to XSD types:

type_mappings:
  # Primitive Types
  string: "xs:string"
  text: "xs:string"
  integer: "xs:integer"
  long: "xs:long"
  positive_integer: "xs:positiveInteger"
  decimal: "xs:decimal"
  boolean: "xs:boolean"
  datetime: "xs:dateTime"
  date: "xs:date"

  # Formatted Types
  uuid: "xs:string"
  uri: "xs:anyURI"
  email: "xs:string"
  language_code: "xs:language"
  version_string: "xs:string"

  # Restricted String Types (use existing types)
  short_string: "strings:ShortString"
  long_string: "strings:LongString"
  identifier_token: "strings:IdentifierToken"

  # Date Range Types (use existing types)
  date_range: "dates:DateRange"
  date_range_open_end: "dates:DateRangeOpenEnd"

  # Coordinate Types (use existing types)
  coordinates: "coordinates:Coordinates"

  # Metadata Types (use existing types)
  extraction_metadata: "meta:Meta"
  source_id_triple: "sourceId:SourceId3"

  # Enumeration Types (use existing types)
  dataset_collection_type: "enums:DatasetCollectionType"
  personal_data_status: "enums:PersonalDataStatus"
  statistical_design_type: "enums:StatisticalDesignType"

Namespace Convention

namespaces:
  target_namespace: "https://schemas.rutdev.se/xsd/entities/{entity}-{version}.xsd"

  common_imports:
    strings: "https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
    dates: "https://schemas.rutdev.se/xsd/types/dates-1.0.xsd"
    enums: "https://schemas.rutdev.se/xsd/types/enums-1.0.xsd"
    coordinates: "https://schemas.rutdev.se/xsd/types/coordinates-1.0.xsd"
    meta: "https://schemas.rutdev.se/xsd/types/meta-1.0.xsd"
    sourceId: "https://schemas.rutdev.se/xsd/types/sourceId-1.0.xsd"

XSD File Template

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:tns="{target_namespace}"
    xmlns:strings="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
    targetNamespace="{target_namespace}"
    elementFormDefault="qualified"
    version="{entity_version}">

    <!--
      {EntityName} Schema

      Description: {entity_description}
      Version: {version}
      Generated: {ISO_date}
      Source: specs/entities/{level}/{version}/{entity}.md
    -->

    <!-- Import common type definitions -->
    <xsd:import
        namespace="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
        schemaLocation="../../types/strings-1.1.xsd"
    />
    <!-- Additional imports as needed (dates, enums, etc.) -->

    <!-- Properties complex type -->
    <xsd:complexType name="{EntityName}PropertiesType">
        <xsd:sequence>
            <!-- Property elements from entity specification -->
            <xsd:element name="idAtOrigin" type="strings:IdentifierToken"/>
            <!-- Additional properties... -->
        </xsd:sequence>
    </xsd:complexType>

    <!-- Relations complex type (if entity has relationships) -->
    <xsd:complexType name="{EntityName}RelationsType">
        <xsd:sequence>
            <!-- Each relationship from entity specification -->
            <!-- Uses strings:EntityReference for all relationship elements -->
            <xsd:element name="{relationshipName}" type="strings:EntityReference" minOccurs="0"/>
        </xsd:sequence>
    </xsd:complexType>

    <!-- Main entity complex type -->
    <xsd:complexType name="{EntityName}Type">
        <xsd:sequence>
            <xsd:element name="properties" type="tns:{EntityName}PropertiesType" minOccurs="0"/>
            <xsd:element name="relations" type="tns:{EntityName}RelationsType" minOccurs="0"/>
        </xsd:sequence>
    </xsd:complexType>

    <!-- Collection type -->
    <xsd:complexType name="{EntityName}ListType">
        <xsd:sequence>
            <xsd:element name="{entityName}" type="tns:{EntityName}Type" 
                         minOccurs="0" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>

    <!-- Root elements -->
    <xsd:element name="{EntityName}" type="tns:{EntityName}Type"/>
    <xsd:element name="{EntityName}List" type="tns:{EntityName}ListType"/>

</xsd:schema>

Type Naming Convention

Component Naming Pattern Example
Main entity type {EntityName}Type ConceptType
Properties type {EntityName}PropertiesType ConceptPropertiesType
Relations type {EntityName}RelationsType ConceptRelationsType
Collection type {EntityName}ListType ConceptListType

Critical Rules for LLM Agents

⚠️ MUST NEVER DO

  1. NEVER use xsd:any
  2. xsd:any creates untyped extension points that break type safety
  3. All elements must be explicitly defined in the schema
  4. If extensibility is needed, define explicit extension elements

  5. NEVER create "chameleon schemas"

  6. All schemas MUST have a targetNamespace attribute
  7. Chameleon schemas (without targetNamespace) adopt the namespace of importing schema
  8. This creates ambiguity and maintenance problems
  9. Reference: https://en.wiktionary.org/wiki/chameleon_schema

  10. NEVER use inheritance to ADD fields to a parent type

  11. Inheritance via xsd:extension MUST NOT add new elements
  12. Parent types define the complete set of fields
  13. Child types can only restrict or constrain parent fields

✅ CORRECT Use of Inheritance

Inheritance via xsd:restriction MAY be used to: - Enforce stricter validations on parent type fields - Narrow value ranges (e.g., positive integer instead of integer) - Restrict string patterns more specifically - Make optional fields required - Reduce maxOccurs values

Example - CORRECT:

<!-- Parent type -->
<xsd:simpleType name="IdentifierToken">
    <xsd:restriction base="xsd:string">
        <xsd:minLength value="1"/>
    </xsd:restriction>
</xsd:simpleType>

<!-- Child type - stricter validation -->
<xsd:simpleType name="PopulationId">
    <xsd:restriction base="strings:IdentifierToken">
        <xsd:pattern value="pop-[a-f0-9-]+"/>
    </xsd:restriction>
</xsd:simpleType>

Example - WRONG:

<!-- NEVER DO THIS -->
<xsd:complexType name="ExtendedEntity">
    <xsd:complexContent>
        <xsd:extension base="tns:BaseEntity">
            <!-- DO NOT add new elements via extension -->
            <xsd:sequence>
                <xsd:element name="newField" type="xsd:string"/>
            </xsd:sequence>
        </xsd:extension>
    </xsd:complexContent>
</xsd:complexType>

Alternative - CORRECT:

<!-- Define complete type with all fields -->
<xsd:complexType name="Entity">
    <xsd:sequence>
        <xsd:element name="id" type="tns:EntityId"/>
        <xsd:element name="properties" type="tns:EntityProperties"/>
        <xsd:element name="relations" type="tns:EntityRelations" minOccurs="0"/>
    </xsd:sequence>
</xsd:complexType>

Generation Rules

1. Entity Structure Overview

Each entity specification generates three complex types plus a collection type:

<!-- Properties type: contains all data fields -->
<xsd:complexType name="{EntityName}PropertiesType">
    <xsd:sequence>
        <!-- All properties from entity specification -->
    </xsd:sequence>
</xsd:complexType>

<!-- Relations type: contains all relationship references -->
<xsd:complexType name="{EntityName}RelationsType">
    <xsd:sequence>
        <!-- All relationships from entity specification -->
        <!-- Each uses strings:EntityReference type -->
    </xsd:sequence>
</xsd:complexType>

<!-- Main entity type: wraps properties and relations -->
<xsd:complexType name="{EntityName}Type">
    <xsd:sequence>
        <xsd:element name="properties" type="tns:{EntityName}PropertiesType" minOccurs="0"/>
        <xsd:element name="relations" type="tns:{EntityName}RelationsType" minOccurs="0"/>
    </xsd:sequence>
</xsd:complexType>

<!-- Collection type: for lists of entities -->
<xsd:complexType name="{EntityName}ListType">
    <xsd:sequence>
        <xsd:element name="{entityName}" type="tns:{EntityName}Type" 
                     minOccurs="0" maxOccurs="unbounded"/>
    </xsd:sequence>
</xsd:complexType>

2. Element Naming

Convert camelCase property names to PascalCase for XML elements:

Spec Property XML Element
idAtOrigin IdAtOrigin
eventPeriod EventPeriod
referencePeriod ReferencePeriod
unitType UnitType

3. Properties Type Generation

For each property in the entity specification's properties: YAML block:

<xsd:complexType name="{EntityName}PropertiesType">
    <xsd:sequence>
        <!-- Required property (required: true) -->
        <xsd:element name="{propertyName}" type="{xsd_type}"/>

        <!-- Optional property (required: false) -->
        <xsd:element name="{propertyName}" type="{xsd_type}" minOccurs="0"/>

        <!-- Multilingual property (use maxOccurs="unbounded") -->
        <xsd:element name="{propertyName}" type="strings:MultilingualShortString" 
                     minOccurs="0" maxOccurs="unbounded"/>
    </xsd:sequence>
</xsd:complexType>

The id_at_origin property is always first and always required.

4. Relations Type Generation

For each relationship in the entity specification's Mermaid diagram, create an element using the shared strings:EntityReference type:

<xsd:complexType name="{EntityName}RelationsType">
    <xsd:sequence>
        <!-- Each relationship uses EntityReference -->
        <xsd:element name="{relationshipName}" type="strings:EntityReference" minOccurs="0"/>
    </xsd:sequence>
</xsd:complexType>

The strings:EntityReference type contains: - Zero or more ref elements - Each ref holds the id_at_origin of a target entity

This structure supports both single references (one ref) and multiple references (many ref elements).

5. Relationship Semantics

Key principle: Relationships are always expressed on the source entity as pointers to target entities.

  • The source entity's RelationsType contains an element for each outgoing relationship
  • Each relationship element uses strings:EntityReference type
  • The ref elements contain id_at_origin values of target entities
  • Target entities do not need to declare inverse relationships

Example: InstanceVariable has relationships to Population, RepresentedVariable, and Dataset:

<xsd:complexType name="InstanceVariableRelationsType">
    <xsd:sequence>
        <xsd:element name="isObservationOf" type="strings:EntityReference" minOccurs="0"/>
        <xsd:element name="takesMeaningFrom" type="strings:EntityReference" minOccurs="0"/>
        <xsd:element name="inDataset" type="strings:EntityReference" minOccurs="0"/>
    </xsd:sequence>
</xsd:complexType>

XML Instance:

<InstanceVariable>
    <relations>
        <isObservationOf>
            <ref>pop-sweden-2024</ref>
        </isObservationOf>
        <takesMeaningFrom>
            <ref>repvar-income-eur</ref>
        </takesMeaningFrom>
        <inDataset>
            <ref>dataset-hbs-2024</ref>
        </inDataset>
    </relations>
</InstanceVariable>

6. Constraints (Optional)

Implement constraints using XSD features when needed:

<!-- Unique constraint on idAtOrigin within a list -->
<xsd:unique name="{entityName}IdAtOriginUnique">
    <xsd:selector xpath="tns:{entityName}"/>
    <xsd:field xpath="tns:properties/tns:idAtOrigin"/>
</xsd:unique>

Complete Example

For the ConceptualVariable entity (which has both properties and relationships):

Entity Specification (specs/entities/level2/conceptual-variable-1.0.md): - Properties: id_at_origin, name, description - Relationships: concept → Concept, unit_type → UnitType

Generated XSD:

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:tns="https://schemas.rutdev.se/xsd/entities/level2/conceptual-variable-1.0.xsd"
    xmlns:strings="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
    targetNamespace="https://schemas.rutdev.se/xsd/entities/level2/conceptual-variable-1.0.xsd"
    elementFormDefault="qualified"
    version="1.0">

    <!--
      ConceptualVariable Schema

      Description: An abstract concept or characteristic that can be measured,
                   independent of any specific representation or measurement method.
      Version: 1.0
      Source: specs/entities/level2/conceptual-variable-1.0.md
    -->

    <!-- Import common type definitions -->
    <xsd:import
        namespace="https://schemas.rutdev.se/xsd/types/strings-1.1.xsd"
        schemaLocation="../../types/strings-1.1.xsd"
    />

    <!-- Properties type: all data fields from entity specification -->
    <xsd:complexType name="ConceptualVariablePropertiesType">
        <xsd:sequence>
            <xsd:element name="idAtOrigin" type="strings:IdentifierToken">
                <xsd:annotation>
                    <xsd:documentation>
                        Identifier at the source system. Used during import to determine 
                        if the entity exists (update) or is new (create).
                    </xsd:documentation>
                </xsd:annotation>
            </xsd:element>
            <xsd:element name="name" type="strings:MultilingualShortString" 
                         minOccurs="0" maxOccurs="unbounded">
                <xsd:annotation>
                    <xsd:documentation>
                        The name of the conceptual variable (multilingual)
                    </xsd:documentation>
                </xsd:annotation>
            </xsd:element>
            <xsd:element name="description" type="strings:MultilingualLongString" 
                         minOccurs="0" maxOccurs="unbounded">
                <xsd:annotation>
                    <xsd:documentation>
                        Detailed conceptual definition of the variable (multilingual)
                    </xsd:documentation>
                </xsd:annotation>
            </xsd:element>
        </xsd:sequence>
    </xsd:complexType>

    <!-- Relations type: all relationships use EntityReference -->
    <xsd:complexType name="ConceptualVariableRelationsType">
        <xsd:sequence>
            <xsd:element name="concept" type="strings:EntityReference" minOccurs="0">
                <xsd:annotation>
                    <xsd:documentation>
                        Reference to the Concept this variable is based on
                    </xsd:documentation>
                </xsd:annotation>
            </xsd:element>
            <xsd:element name="unitType" type="strings:EntityReference" minOccurs="0">
                <xsd:annotation>
                    <xsd:documentation>
                        Reference to the UnitType this variable measures
                    </xsd:documentation>
                </xsd:annotation>
            </xsd:element>
        </xsd:sequence>
    </xsd:complexType>

    <!-- Main entity type -->
    <xsd:complexType name="ConceptualVariableType">
        <xsd:sequence>
            <xsd:element name="properties" type="tns:ConceptualVariablePropertiesType" 
                         minOccurs="0"/>
            <xsd:element name="relations" type="tns:ConceptualVariableRelationsType" 
                         minOccurs="0"/>
        </xsd:sequence>
    </xsd:complexType>

    <!-- Collection type -->
    <xsd:complexType name="ConceptualVariableListType">
        <xsd:sequence>
            <xsd:element name="conceptualVariable" type="tns:ConceptualVariableType" 
                         minOccurs="0" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>

    <!-- Root elements -->
    <xsd:element name="ConceptualVariable" type="tns:ConceptualVariableType"/>
    <xsd:element name="ConceptualVariableList" type="tns:ConceptualVariableListType"/>

</xsd:schema>

Example XML Instance:

<?xml version="1.0" encoding="UTF-8"?>
<ConceptualVariable xmlns="https://schemas.rutdev.se/xsd/entities/level2/conceptual-variable-1.0.xsd">
    <properties>
        <idAtOrigin>cv-annual-income</idAtOrigin>
        <name xml:lang="en">Annual Income</name>
        <name xml:lang="sv">Årsinkomst</name>
        <description xml:lang="en">Total income received during a calendar year</description>
    </properties>
    <relations>
        <concept>
            <ref>concept-income</ref>
        </concept>
        <unitType>
            <ref>unit-person</ref>
        </unitType>
    </relations>
</ConceptualVariable>

Best Practices

  1. Reuse Common Types: Always import and use types from rut-schemas/src/main/xsd/types/
  2. Documentation: Include comprehensive annotations
  3. Versioning: Include version in namespace and filename
  4. Separation: Separate ID, Properties, and Relations into distinct types
  5. Collections: Provide collection types for list operations
  6. Root Elements: Define root elements for both single and collection
  7. Constraints: Use XSD constraints (unique, key, keyref) where applicable
  8. Validation: Validate generated XSD against XML Schema spec
  9. Compatibility: Maintain backward compatibility with existing schemas
  10. Target Namespace: ALWAYS define targetNamespace (no chameleon schemas)
  11. Explicit Types: NEVER use xsd:any (define all elements explicitly)
  12. Composition Over Extension: Use separate types instead of extending base types with new fields
  13. Restriction Only: Use xsd:restriction for inheritance, never xsd:extension for adding fields
  14. Qualified Elements: ALWAYS use elementFormDefault="qualified" — never "unqualified"

Output Location

rut-schemas/src/main/xsd/
├── types/                    # Common type definitions (existing)
│   ├── strings-1.1.xsd
│   ├── dates-1.0.xsd
│   ├── enums-1.0.xsd
│   └── ...
└── entities/                 # Entity schemas (flat structure)
    ├── population-1.0.xsd
    ├── universe-1.0.xsd
    ├── conceptual-variable-1.0.xsd
    ├── represented-variable-1.0.xsd
    ├── data-set-1.0.xsd
    ├── statistical-classification-1.0.xsd
    └── organization-1.0.xsd

Integration with Existing Schemas

When generating new entity schemas: 1. Check if similar entity already exists in rut-schemas/ 2. Reuse existing patterns and structure 3. Import all common types from types/ directory 4. Follow established naming conventions 5. Maintain consistent structure across all entity schemas 6. Add new common types to types/ if needed

Validation Tools

After generation: 1. Validate XSD syntax with xmllint 2. Test with sample XML instances 3. Check for circular dependencies 4. Verify namespace consistency 5. Ensure all imports resolve correctly 6. Verify no xsd:any elements (grep for "xsd:any") 7. Verify targetNamespace exists (check schema element) 8. Verify no field-adding extensions (check for xsd:extension with new elements)

Common Anti-Patterns to Avoid

❌ Anti-Pattern 1: Using xsd:any

<!-- WRONG - Do not use xsd:any -->
<xsd:complexType name="Entity">
    <xsd:sequence>
        <xsd:element name="id" type="xsd:string"/>
        <xsd:any namespace="##any" processContents="lax"/>
    </xsd:sequence>
</xsd:complexType>

❌ Anti-Pattern 2: Chameleon Schema

<!-- WRONG - Missing targetNamespace -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            elementFormDefault="qualified">
    <!-- This becomes a chameleon schema - DO NOT DO THIS -->
</xsd:schema>

Correct:

<!-- CORRECT - Always include targetNamespace -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            targetNamespace="https://schemas.rutdev.se/xsd/entities/entity-1.0.xsd"
            elementFormDefault="qualified">
    <!-- Well-defined namespace -->
</xsd:schema>

❌ Anti-Pattern 3: Adding Fields via Extension

<!-- WRONG - Do not add new fields via extension -->
<xsd:complexType name="ExtendedEntity">
    <xsd:complexContent>
        <xsd:extension base="tns:BaseEntity">
            <xsd:sequence>
                <!-- DO NOT add new elements here -->
                <xsd:element name="additionalField" type="xsd:string"/>
            </xsd:sequence>
        </xsd:extension>
    </xsd:complexContent>
</xsd:complexType>

Correct Approach:

<!-- CORRECT - Define complete type or use composition -->
<xsd:complexType name="EntityProperties">
    <xsd:sequence>
        <xsd:element name="baseField1" type="xsd:string"/>
        <xsd:element name="baseField2" type="xsd:string"/>
        <xsd:element name="additionalField" type="xsd:string"/>
    </xsd:sequence>
</xsd:complexType>

❌ Anti-Pattern 4: Using elementFormDefault="unqualified"

<!-- WRONG - Do not use unqualified -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            targetNamespace="https://schemas.rutdev.se/xsd/entities/entity-1.0.xsd"
            elementFormDefault="unqualified">
    <!-- Local elements will NOT be namespace-qualified — DO NOT DO THIS -->
</xsd:schema>

Correct:

<!-- CORRECT - Always use qualified -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            targetNamespace="https://schemas.rutdev.se/xsd/entities/entity-1.0.xsd"
            elementFormDefault="qualified">
    <!-- All elements are namespace-qualified -->
</xsd:schema>

✅ Correct Pattern: Restriction for Validation

<!-- CORRECT - Use restriction for stricter validation -->
<xsd:simpleType name="PositiveQuantity">
    <xsd:restriction base="xsd:integer">
        <xsd:minInclusive value="1"/>
    </xsd:restriction>
</xsd:simpleType>

<xsd:simpleType name="SmallPositiveQuantity">
    <xsd:restriction base="tns:PositiveQuantity">
        <xsd:maxInclusive value="100"/>
    </xsd:restriction>
</xsd:simpleType>