Kjell Tore Guttormsen 6a7632146e feat(ms-ai-architect): add plugin to open marketplace (v1.5.0 baseline)

Initial addition of ms-ai-architect plugin to the open-source marketplace.
Private content excluded: orchestrator/ (Linear tooling), docs/utredning/
(client investigation), generated test reports and PDF export script.
skill-gen tooling moved from orchestrator/ to scripts/skill-gen/.

Security scan: WARNING (risk 20/100) — no secrets, no injection found.
False positive fixed: added gitleaks:allow to Python variable reference
in output-validation-grounding-verification.md line 109.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-07 17:17:17 +02:00

18 KiB

Raw Blame History

Cross-Cloud Data Integration

Last updated: 2026-02 Status: GA Category: Data Engineering for AI

Introduksjon

Mange organisasjoner opererer i multi-cloud-miljoer der data er spredt mellom Azure, AWS, Google Cloud og on-premises systemer. For AI-losninger som krever data fra flere kilder er det kritisk a ha en effektiv strategi for krysssky-dataintegrasjon. Microsoft Fabric sin OneLake og shortcuts-arkitektur gjor det mulig a virtuelt samle data fra ulike skyplattformer uten fysisk kopiering, noe som reduserer bade egress-kostnader og kompleksitet.

OneLake fungerer som et enkelt virtuelt datalake for hele organisasjonen, der shortcuts oppretter referanser til data i Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen2 og andre lagringskilder. Med intelligent caching kan Fabric redusere krysssky-datautgifter ved a lagre hyppig brukte filer lokalt i workspacet.

For norsk offentlig sektor, der datasuverenitet og datalagring i Norge/EOS er regulert, er krysssky-integrasjon spesielt sensitivt. Fabric sin fleksibilitet med shortcuts og caching gjor det mulig a integrere data fra ulike kilder uten a flytte sensitiv data ut av godkjente lagringsomrader.

Multi-Cloud Connector Strategies

OneLake Shortcuts som primaerstrategi

OneLake shortcuts er den foretrukne mekanismen for krysssky-dataintegrasjon i Fabric:

Kilde	Shortcut-type	Autentisering	Caching
Azure Data Lake Gen2	ADLS shortcut	Service principal / Account key	Nei (samme sky)
Amazon S3	S3 shortcut	IAM Access Key / Secret	Ja (1-28 dager)
Google Cloud Storage	GCS shortcut	Service Account JSON	Ja (1-28 dager)
S3-kompatibel	S3-compatible	Access Key / Secret	Ja (1-28 dager)
On-premises	Via OPDG	On-premises Data Gateway	Ja (1-28 dager)
Annen Fabric-tenant	OneLake shortcut	Data Sharing invitation	Nei

Opprette shortcuts til ulike skyplattformer

import requests

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

# --- AWS S3 Shortcut ---
s3_shortcut = {
    "name": "aws_training_data",
    "path": "Files/external/aws",
    "target": {
        "amazonS3": {
            "location": "https://my-bucket.s3.eu-north-1.amazonaws.com",
            "subpath": "/ai-data/training/",
            "connectionId": "s3-connection-id"
        }
    }
}

# --- Google Cloud Storage Shortcut ---
gcs_shortcut = {
    "name": "gcp_sensor_data",
    "path": "Files/external/gcp",
    "target": {
        "googleCloudStorage": {
            "location": "https://storage.googleapis.com/my-gcs-bucket",
            "subpath": "/sensor-readings/",
            "connectionId": "gcs-connection-id"
        }
    }
}

# --- On-premises via Data Gateway ---
onprem_shortcut = {
    "name": "onprem_legacy_data",
    "path": "Files/external/onprem",
    "target": {
        "amazonS3": {  # S3-kompatibel on-prem storage
            "location": "https://minio.internal.no:9000",
            "subpath": "/legacy-data/",
            "connectionId": "onprem-s3-connection-id"
        }
    }
}

# Opprett shortcuts
for shortcut in [s3_shortcut, gcs_shortcut, onprem_shortcut]:
    response = requests.post(
        f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/items/{lakehouse_id}/shortcuts",
        headers=headers,
        json=shortcut
    )
    print(f"Opprettet shortcut '{shortcut['name']}': {response.status_code}")

Data Factory Connectors for ETL

For scenarier der shortcuts ikke er tilstrekkelig (transformasjon, filtrering, format-konvertering):

{
    "name": "CopyFromAWSToFabric",
    "type": "Copy",
    "inputs": [
        {
            "referenceName": "AmazonS3ParquetSource",
            "type": "DatasetReference",
            "parameters": {
                "bucket": "ai-training-data",
                "prefix": "features/2026/02/"
            }
        }
    ],
    "outputs": [
        {
            "referenceName": "FabricLakehouseSink",
            "type": "DatasetReference",
            "parameters": {
                "tableName": "external_features"
            }
        }
    ],
    "typeProperties": {
        "source": {
            "type": "ParquetSource"
        },
        "sink": {
            "type": "LakehouseTableSink",
            "tableActionOption": "Append"
        }
    }
}

Connector-oversikt for multi-cloud

Kilde/Mal	Fabric Pipeline	Dataflow Gen2	Shortcut	Direktelesing (Spark)
AWS S3	Ja	Ja	Ja	Via shortcut
AWS Redshift	Ja	Ja	Nei	Via JDBC
Google BigQuery	Ja	Ja	Nei	Via JDBC
Google Cloud Storage	Ja	Ja	Ja	Via shortcut
Snowflake	Ja	Ja	Nei	Via JDBC/connector
Oracle	Ja (via OPDG)	Ja	Nei	Via JDBC
SAP HANA	Ja	Ja	Nei	Via JDBC
MongoDB Atlas	Ja	Ja	Nei	Via connector

Data Egress Cost Optimization

Forstaa egress-kostnader

Skyplattform	Intern egress	Kryssregion egress	Internet egress
Azure	Gratis (samme region)	~$0.02/GB	~$0.087/GB
AWS	Gratis (samme AZ)	~$0.01-0.02/GB	~$0.09/GB
GCP	Gratis (samme region)	~$0.01/GB	~$0.08-0.12/GB

Kostnadsoptimaliseringsstrategier

Strategi 1: SHORTCUT CACHING (anbefalt)
+------------------------------------------+
| OneLake cacher filer fra S3/GCS lokalt  |
| - Forste lesing: Full egress-kostnad    |
| - Paafolgende: Ingen egress (cache hit) |
| - Retensjon: 1-28 dager konfigurerbar  |
| - Maks filstorrelse for cache: 1 GB    |
+------------------------------------------+

Strategi 2: PERIODISK KOPIERING
+------------------------------------------+
| Kopier data pa faste intervaller        |
| - Daglig/ukentlig batch-kopi            |
| - Komprimert overfoering (Parquet)      |
| - Kun inkrementelle endringer           |
+------------------------------------------+

Strategi 3: FEDERATED QUERY
+------------------------------------------+
| Spark foresporsel mot ekstern kilde     |
| - Pushdown-predikater reduserer volum   |
| - Partisjonspruning minimerer egress    |
| - Bruk for ad-hoc, ikke produksjon      |
+------------------------------------------+

Konfigurere shortcut-caching

# Aktiver caching for workspace via REST API
cache_config = {
    "settings": {
        "oneLake": {
            "shortcutCaching": {
                "enabled": True,
                "retentionPeriodInDays": 7  # 1-28 dager
            }
        }
    }
}

response = requests.patch(
    f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/settings",
    headers=headers,
    json=cache_config
)

Beregn egress-kostnader

def estimate_monthly_egress_cost(
    data_volume_gb: float,
    read_frequency_per_month: int,
    cache_hit_ratio: float,
    source_cloud: str,
    cost_per_gb: float = None
) -> dict:
    """
    Estimer maanedlig egress-kostnad for krysssky-data.
    """
    costs = {
        "aws_s3": 0.09,
        "gcp_gcs": 0.12,
        "azure_blob": 0.087
    }

    if cost_per_gb is None:
        cost_per_gb = costs.get(source_cloud, 0.10)

    # Uten caching
    total_reads_gb = data_volume_gb * read_frequency_per_month
    cost_without_cache = total_reads_gb * cost_per_gb

    # Med caching
    cache_misses = total_reads_gb * (1 - cache_hit_ratio)
    cost_with_cache = cache_misses * cost_per_gb

    savings = cost_without_cache - cost_with_cache

    return {
        "total_data_read_gb": total_reads_gb,
        "cost_without_cache_nok": round(cost_without_cache * 11, 2),  # ~11 NOK/USD
        "cost_with_cache_nok": round(cost_with_cache * 11, 2),
        "monthly_savings_nok": round(savings * 11, 2),
        "cache_hit_ratio": cache_hit_ratio,
        "recommendation": (
            "Aktiver caching" if savings > 100
            else "Caching gir liten gevinst"
        )
    }

# Eksempel: 500 GB data lest 30 ganger/maaned fra AWS
result = estimate_monthly_egress_cost(
    data_volume_gb=500,
    read_frequency_per_month=30,
    cache_hit_ratio=0.85,  # 85% cache hit med 7-dagers retensjon
    source_cloud="aws_s3"
)
# Besparelse: ~12,000 NOK/mnd med caching

Consistency and Synchronization Patterns

Eventual Consistency med Shortcuts

Shortcuts gir eventual consistency -- endringer i kildesystemet reflekteres ved neste lesing:

Tidslinje:
T0: AWS S3 oppdateres med nye filer
T1: Fabric leser via shortcut -> ser nye filer
T2: Cached versjon brukes (hvis caching er aktivert)
T3: Cache utloper -> ny lesing fra S3

Change Data Capture (CDC) fra multi-cloud

# CDC-moenster for synkronisering fra ekstern database
from pyspark.sql import functions as F

def incremental_sync_from_external(
    source_connection: str,
    source_table: str,
    target_table: str,
    watermark_column: str,
    watermark_table: str = "lakehouse.default.sync_watermarks"
):
    """
    Inkrementell synkronisering fra ekstern database til Fabric.
    """
    # 1. Hent siste watermark
    try:
        last_watermark = spark.sql(f"""
            SELECT MAX(watermark_value) as wm
            FROM {watermark_table}
            WHERE source_table = '{source_table}'
        """).collect()[0]["wm"]
    except Exception:
        last_watermark = "1970-01-01T00:00:00Z"

    # 2. Les inkrementelle endringer fra ekstern kilde
    new_data = spark.read \
        .format("jdbc") \
        .option("url", source_connection) \
        .option("dbtable", f"""
            (SELECT * FROM {source_table}
             WHERE {watermark_column} > '{last_watermark}')
        """) \
        .load()

    if new_data.count() == 0:
        print(f"Ingen nye endringer i {source_table}")
        return

    # 3. Skriv til Fabric Lakehouse
    new_data.write \
        .format("delta") \
        .mode("append") \
        .saveAsTable(target_table)

    # 4. Oppdater watermark
    new_watermark = new_data.agg(F.max(watermark_column)).collect()[0][0]
    spark.sql(f"""
        MERGE INTO {watermark_table} AS t
        USING (SELECT '{source_table}' as source_table,
                      '{new_watermark}' as watermark_value) AS s
        ON t.source_table = s.source_table
        WHEN MATCHED THEN UPDATE SET watermark_value = s.watermark_value
        WHEN NOT MATCHED THEN INSERT (source_table, watermark_value)
             VALUES (s.source_table, s.watermark_value)
    """)

    print(f"Synkronisert {new_data.count()} rader fra {source_table}")

# Synkroniser fra AWS RDS PostgreSQL
incremental_sync_from_external(
    source_connection="jdbc:postgresql://rds-instance.amazonaws.com:5432/aidata",
    source_table="public.sensor_readings",
    target_table="lakehouse.default.external_sensors",
    watermark_column="updated_at"
)

Konflikthondtering for bi-direksjonell synk

Strategi	Beskrivelse	Bruksomrade
Last-write-wins	Siste endring vinner	Enkel, akseptabel tap
Source-of-truth	En kilde har prioritet	Master i ett system
Merge	Kombiner endringer intelligent	Komplekst, men komplett
Event sourcing	Alle endringer er hendelser	Historikk bevart

Hybrid Cloud Fallback Mechanisms

On-premises Data Gateway

For tilgang til data bak brannmur eller i private nettverk:

Internet                     On-premises nettverk
+--------+                   +-------------------+
| Fabric | <-- HTTPS --> | Data Gateway     |
| Service|   (utgoende)    | (Windows-agent)  |
+--------+                   |                   |
                             | --> S3-kompatibel |
                             | --> SQL Server    |
                             | --> Filsystem     |
                             +-------------------+

Viktig: Gateway-en initierer utgaende tilkoblinger -- ingen inngoende regler kreves.

Fallback-arkitektur

class MultiCloudDataAccess:
    """
    Robust datatilgang med automatisk fallback mellom kilder.
    """

    def __init__(self, primary_source: dict, fallback_sources: list):
        self.primary = primary_source
        self.fallbacks = fallback_sources

    def read_data(self, table_name: str) -> "DataFrame":
        """
        Forsok a lese fra primaerkilde, fall tilbake til alternativer ved feil.
        """
        sources = [self.primary] + self.fallbacks

        for i, source in enumerate(sources):
            try:
                df = self._read_from_source(source, table_name)
                if i > 0:
                    print(f"ADVARSEL: Brukte fallback-kilde #{i}: {source['name']}")
                return df
            except Exception as e:
                print(f"Feil med kilde '{source['name']}': {e}")
                if i == len(sources) - 1:
                    raise RuntimeError(f"Alle kilder feilet for {table_name}")

    def _read_from_source(self, source: dict, table_name: str) -> "DataFrame":
        if source["type"] == "lakehouse":
            return spark.table(f"{source['catalog']}.{table_name}")
        elif source["type"] == "s3_shortcut":
            return spark.read.parquet(f"{source['path']}/{table_name}")
        elif source["type"] == "jdbc":
            return spark.read.format("jdbc") \
                .option("url", source["connection"]) \
                .option("dbtable", table_name) \
                .load()

# Konfigurasjon
data_access = MultiCloudDataAccess(
    primary_source={
        "name": "Fabric Lakehouse",
        "type": "lakehouse",
        "catalog": "lakehouse.default"
    },
    fallback_sources=[
        {
            "name": "AWS S3 via shortcut",
            "type": "s3_shortcut",
            "path": "abfss://workspace@onelake.dfs.fabric.microsoft.com/lakehouse/Files/external/aws"
        },
        {
            "name": "On-premises SQL Server",
            "type": "jdbc",
            "connection": "jdbc:sqlserver://sql.internal.no:1433;database=AIDatalake"
        }
    ]
)

df = data_access.read_data("training_features")

Data Residency and Sovereignty Compliance

Norske og europeiske krav

Krav	Regulering	Implikasjon for krysssky
Data i Norge	Sikkerhetsloven, NSM	Sensitiv data kan ikke lagres utenfor Norge
Data i EOS	GDPR, Schrems II	Persondata i EOS/EU eller med tilstrekkelig beskyttelse
Overforingsmekanismer	GDPR Art. 46	SCC, Adequacy decisions for tredjeland
Suverenitet	Nasjonal kontroll	Nokler og tilgang kontrollert av norsk personell

Dataklassifisering for krysssky

data_residency_rules = {
    "HEMMELIG": {
        "allowed_locations": ["Norway East"],
        "cross_cloud": False,
        "encryption": "Customer-managed keys (Norwegian HSM)"
    },
    "FORTROLIG": {
        "allowed_locations": ["Norway East", "Norway West"],
        "cross_cloud": False,
        "encryption": "Customer-managed keys"
    },
    "INTERN": {
        "allowed_locations": ["EU/EEA regions"],
        "cross_cloud": True,  # Kun EU-regioner
        "encryption": "Platform-managed keys"
    },
    "OFFENTLIG": {
        "allowed_locations": ["Alle"],
        "cross_cloud": True,
        "encryption": "Platform-managed keys"
    }
}

def validate_data_residency(data_classification: str, target_region: str) -> bool:
    """Valider at dataoverfoering overholder residency-krav."""
    rules = data_residency_rules.get(data_classification)
    if not rules:
        return False

    if not rules["cross_cloud"]:
        return target_region in rules["allowed_locations"]

    return target_region in rules["allowed_locations"] or rules["allowed_locations"] == ["Alle"]

OneLake-regioner og dataplassering

# Sikre at Fabric workspace er i riktig region
workspace_info = requests.get(
    f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}",
    headers=headers
).json()

capacity_region = workspace_info.get("capacityRegion")
print(f"Workspace region: {capacity_region}")

# For norsk offentlig sektor: Krev Norway East
assert capacity_region == "norwayeast", \
    f"FEIL: Workspace er i {capacity_region}, krever norwayeast for sensitiv data"

Referanser

OneLake shortcuts -- Oversikt over shortcuts og stottede kilder
Create an Amazon S3 shortcut -- AWS S3-integrasjon
Create an Amazon S3 compatible shortcut -- S3-kompatible kilder
Create shortcuts to on-premises data -- On-premises via Data Gateway
OneLake shortcut security -- Passthrough vs. delegated security
OneLake, the OneDrive for data -- OneLake-arkitektur og one copy of data
Microsoft Fabric integration pathways for ISVs -- Multi-cloud connector-oversikt
External data sharing overview -- Cross-tenant datadeling

For Cosmo

Bruk denne referansen naar kunder har data i flere skyplattformer og trenger a integrere dem for AI-formaal uten a kopiere alt til Azure.
OneLake shortcuts er primaerstrategien for krysssky-dataintegrasjon. De unngaar dataduplisering, reduserer egress-kostnader med caching, og er enklere a vedlikeholde enn ETL-pipelines.
Caching er essensielt for kostnader: Aktiver shortcut-caching med passende retensjon (7 dager er god standard) for a redusere egress-kostnader med 70-90%.
Datasuverenitet forst: For norsk offentlig sektor, klassifiser data for du planlegger krysssky-integrasjon. HEMMELIG og FORTROLIG data skal aldri forlate Norge-regioner.
On-premises Data Gateway for legacy-systemer: Bruker kun utgaende HTTPS, ingen endringer i brannmurregler noedvendig. Stotter S3-kompatibel lagring og andre kilder bak brannmur.

18 KiB Raw Blame History