# Cross-Cloud Data Integration **Last updated:** 2026-02 **Status:** GA **Category:** Data Engineering for AI --- ## Introduksjon Mange organisasjoner opererer i multi-cloud-miljoer der data er spredt mellom Azure, AWS, Google Cloud og on-premises systemer. For AI-losninger som krever data fra flere kilder er det kritisk a ha en effektiv strategi for krysssky-dataintegrasjon. Microsoft Fabric sin OneLake og shortcuts-arkitektur gjor det mulig a virtuelt samle data fra ulike skyplattformer uten fysisk kopiering, noe som reduserer bade egress-kostnader og kompleksitet. OneLake fungerer som et enkelt virtuelt datalake for hele organisasjonen, der shortcuts oppretter referanser til data i Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen2 og andre lagringskilder. Med intelligent caching kan Fabric redusere krysssky-datautgifter ved a lagre hyppig brukte filer lokalt i workspacet. For norsk offentlig sektor, der datasuverenitet og datalagring i Norge/EOS er regulert, er krysssky-integrasjon spesielt sensitivt. Fabric sin fleksibilitet med shortcuts og caching gjor det mulig a integrere data fra ulike kilder uten a flytte sensitiv data ut av godkjente lagringsomrader. --- ## Multi-Cloud Connector Strategies ### OneLake Shortcuts som primaerstrategi OneLake shortcuts er den foretrukne mekanismen for krysssky-dataintegrasjon i Fabric: | Kilde | Shortcut-type | Autentisering | Caching | |-------|-------------|---------------|---------| | **Azure Data Lake Gen2** | ADLS shortcut | Service principal / Account key | Nei (samme sky) | | **Amazon S3** | S3 shortcut | IAM Access Key / Secret | Ja (1-28 dager) | | **Google Cloud Storage** | GCS shortcut | Service Account JSON | Ja (1-28 dager) | | **S3-kompatibel** | S3-compatible | Access Key / Secret | Ja (1-28 dager) | | **On-premises** | Via OPDG | On-premises Data Gateway | Ja (1-28 dager) | | **Annen Fabric-tenant** | OneLake shortcut | Data Sharing invitation | Nei | ### Opprette shortcuts til ulike skyplattformer ```python import requests headers = { "Authorization": f"Bearer {access_token}", "Content-Type": "application/json" } # --- AWS S3 Shortcut --- s3_shortcut = { "name": "aws_training_data", "path": "Files/external/aws", "target": { "amazonS3": { "location": "https://my-bucket.s3.eu-north-1.amazonaws.com", "subpath": "/ai-data/training/", "connectionId": "s3-connection-id" } } } # --- Google Cloud Storage Shortcut --- gcs_shortcut = { "name": "gcp_sensor_data", "path": "Files/external/gcp", "target": { "googleCloudStorage": { "location": "https://storage.googleapis.com/my-gcs-bucket", "subpath": "/sensor-readings/", "connectionId": "gcs-connection-id" } } } # --- On-premises via Data Gateway --- onprem_shortcut = { "name": "onprem_legacy_data", "path": "Files/external/onprem", "target": { "amazonS3": { # S3-kompatibel on-prem storage "location": "https://minio.internal.no:9000", "subpath": "/legacy-data/", "connectionId": "onprem-s3-connection-id" } } } # Opprett shortcuts for shortcut in [s3_shortcut, gcs_shortcut, onprem_shortcut]: response = requests.post( f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/items/{lakehouse_id}/shortcuts", headers=headers, json=shortcut ) print(f"Opprettet shortcut '{shortcut['name']}': {response.status_code}") ``` ### Data Factory Connectors for ETL For scenarier der shortcuts ikke er tilstrekkelig (transformasjon, filtrering, format-konvertering): ```json { "name": "CopyFromAWSToFabric", "type": "Copy", "inputs": [ { "referenceName": "AmazonS3ParquetSource", "type": "DatasetReference", "parameters": { "bucket": "ai-training-data", "prefix": "features/2026/02/" } } ], "outputs": [ { "referenceName": "FabricLakehouseSink", "type": "DatasetReference", "parameters": { "tableName": "external_features" } } ], "typeProperties": { "source": { "type": "ParquetSource" }, "sink": { "type": "LakehouseTableSink", "tableActionOption": "Append" } } } ``` ### Connector-oversikt for multi-cloud | Kilde/Mal | Fabric Pipeline | Dataflow Gen2 | Shortcut | Direktelesing (Spark) | |-----------|----------------|---------------|----------|----------------------| | AWS S3 | Ja | Ja | Ja | Via shortcut | | AWS Redshift | Ja | Ja | Nei | Via JDBC | | Google BigQuery | Ja | Ja | Nei | Via JDBC | | Google Cloud Storage | Ja | Ja | Ja | Via shortcut | | Snowflake | Ja | Ja | Nei | Via JDBC/connector | | Oracle | Ja (via OPDG) | Ja | Nei | Via JDBC | | SAP HANA | Ja | Ja | Nei | Via JDBC | | MongoDB Atlas | Ja | Ja | Nei | Via connector | --- ## Data Egress Cost Optimization ### Forstaa egress-kostnader | Skyplattform | Intern egress | Kryssregion egress | Internet egress | |-------------|--------------|-------------------|----------------| | **Azure** | Gratis (samme region) | ~$0.02/GB | ~$0.087/GB | | **AWS** | Gratis (samme AZ) | ~$0.01-0.02/GB | ~$0.09/GB | | **GCP** | Gratis (samme region) | ~$0.01/GB | ~$0.08-0.12/GB | ### Kostnadsoptimaliseringsstrategier ``` Strategi 1: SHORTCUT CACHING (anbefalt) +------------------------------------------+ | OneLake cacher filer fra S3/GCS lokalt | | - Forste lesing: Full egress-kostnad | | - Paafolgende: Ingen egress (cache hit) | | - Retensjon: 1-28 dager konfigurerbar | | - Maks filstorrelse for cache: 1 GB | +------------------------------------------+ Strategi 2: PERIODISK KOPIERING +------------------------------------------+ | Kopier data pa faste intervaller | | - Daglig/ukentlig batch-kopi | | - Komprimert overfoering (Parquet) | | - Kun inkrementelle endringer | +------------------------------------------+ Strategi 3: FEDERATED QUERY +------------------------------------------+ | Spark foresporsel mot ekstern kilde | | - Pushdown-predikater reduserer volum | | - Partisjonspruning minimerer egress | | - Bruk for ad-hoc, ikke produksjon | +------------------------------------------+ ``` ### Konfigurere shortcut-caching ```python # Aktiver caching for workspace via REST API cache_config = { "settings": { "oneLake": { "shortcutCaching": { "enabled": True, "retentionPeriodInDays": 7 # 1-28 dager } } } } response = requests.patch( f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/settings", headers=headers, json=cache_config ) ``` ### Beregn egress-kostnader ```python def estimate_monthly_egress_cost( data_volume_gb: float, read_frequency_per_month: int, cache_hit_ratio: float, source_cloud: str, cost_per_gb: float = None ) -> dict: """ Estimer maanedlig egress-kostnad for krysssky-data. """ costs = { "aws_s3": 0.09, "gcp_gcs": 0.12, "azure_blob": 0.087 } if cost_per_gb is None: cost_per_gb = costs.get(source_cloud, 0.10) # Uten caching total_reads_gb = data_volume_gb * read_frequency_per_month cost_without_cache = total_reads_gb * cost_per_gb # Med caching cache_misses = total_reads_gb * (1 - cache_hit_ratio) cost_with_cache = cache_misses * cost_per_gb savings = cost_without_cache - cost_with_cache return { "total_data_read_gb": total_reads_gb, "cost_without_cache_nok": round(cost_without_cache * 11, 2), # ~11 NOK/USD "cost_with_cache_nok": round(cost_with_cache * 11, 2), "monthly_savings_nok": round(savings * 11, 2), "cache_hit_ratio": cache_hit_ratio, "recommendation": ( "Aktiver caching" if savings > 100 else "Caching gir liten gevinst" ) } # Eksempel: 500 GB data lest 30 ganger/maaned fra AWS result = estimate_monthly_egress_cost( data_volume_gb=500, read_frequency_per_month=30, cache_hit_ratio=0.85, # 85% cache hit med 7-dagers retensjon source_cloud="aws_s3" ) # Besparelse: ~12,000 NOK/mnd med caching ``` --- ## Consistency and Synchronization Patterns ### Eventual Consistency med Shortcuts Shortcuts gir eventual consistency -- endringer i kildesystemet reflekteres ved neste lesing: ``` Tidslinje: T0: AWS S3 oppdateres med nye filer T1: Fabric leser via shortcut -> ser nye filer T2: Cached versjon brukes (hvis caching er aktivert) T3: Cache utloper -> ny lesing fra S3 ``` ### Change Data Capture (CDC) fra multi-cloud ```python # CDC-moenster for synkronisering fra ekstern database from pyspark.sql import functions as F def incremental_sync_from_external( source_connection: str, source_table: str, target_table: str, watermark_column: str, watermark_table: str = "lakehouse.default.sync_watermarks" ): """ Inkrementell synkronisering fra ekstern database til Fabric. """ # 1. Hent siste watermark try: last_watermark = spark.sql(f""" SELECT MAX(watermark_value) as wm FROM {watermark_table} WHERE source_table = '{source_table}' """).collect()[0]["wm"] except Exception: last_watermark = "1970-01-01T00:00:00Z" # 2. Les inkrementelle endringer fra ekstern kilde new_data = spark.read \ .format("jdbc") \ .option("url", source_connection) \ .option("dbtable", f""" (SELECT * FROM {source_table} WHERE {watermark_column} > '{last_watermark}') """) \ .load() if new_data.count() == 0: print(f"Ingen nye endringer i {source_table}") return # 3. Skriv til Fabric Lakehouse new_data.write \ .format("delta") \ .mode("append") \ .saveAsTable(target_table) # 4. Oppdater watermark new_watermark = new_data.agg(F.max(watermark_column)).collect()[0][0] spark.sql(f""" MERGE INTO {watermark_table} AS t USING (SELECT '{source_table}' as source_table, '{new_watermark}' as watermark_value) AS s ON t.source_table = s.source_table WHEN MATCHED THEN UPDATE SET watermark_value = s.watermark_value WHEN NOT MATCHED THEN INSERT (source_table, watermark_value) VALUES (s.source_table, s.watermark_value) """) print(f"Synkronisert {new_data.count()} rader fra {source_table}") # Synkroniser fra AWS RDS PostgreSQL incremental_sync_from_external( source_connection="jdbc:postgresql://rds-instance.amazonaws.com:5432/aidata", source_table="public.sensor_readings", target_table="lakehouse.default.external_sensors", watermark_column="updated_at" ) ``` ### Konflikthondtering for bi-direksjonell synk | Strategi | Beskrivelse | Bruksomrade | |----------|-------------|-------------| | **Last-write-wins** | Siste endring vinner | Enkel, akseptabel tap | | **Source-of-truth** | En kilde har prioritet | Master i ett system | | **Merge** | Kombiner endringer intelligent | Komplekst, men komplett | | **Event sourcing** | Alle endringer er hendelser | Historikk bevart | --- ## Hybrid Cloud Fallback Mechanisms ### On-premises Data Gateway For tilgang til data bak brannmur eller i private nettverk: ``` Internet On-premises nettverk +--------+ +-------------------+ | Fabric | <-- HTTPS --> | Data Gateway | | Service| (utgoende) | (Windows-agent) | +--------+ | | | --> S3-kompatibel | | --> SQL Server | | --> Filsystem | +-------------------+ ``` **Viktig**: Gateway-en initierer utgaende tilkoblinger -- ingen inngoende regler kreves. ### Fallback-arkitektur ```python class MultiCloudDataAccess: """ Robust datatilgang med automatisk fallback mellom kilder. """ def __init__(self, primary_source: dict, fallback_sources: list): self.primary = primary_source self.fallbacks = fallback_sources def read_data(self, table_name: str) -> "DataFrame": """ Forsok a lese fra primaerkilde, fall tilbake til alternativer ved feil. """ sources = [self.primary] + self.fallbacks for i, source in enumerate(sources): try: df = self._read_from_source(source, table_name) if i > 0: print(f"ADVARSEL: Brukte fallback-kilde #{i}: {source['name']}") return df except Exception as e: print(f"Feil med kilde '{source['name']}': {e}") if i == len(sources) - 1: raise RuntimeError(f"Alle kilder feilet for {table_name}") def _read_from_source(self, source: dict, table_name: str) -> "DataFrame": if source["type"] == "lakehouse": return spark.table(f"{source['catalog']}.{table_name}") elif source["type"] == "s3_shortcut": return spark.read.parquet(f"{source['path']}/{table_name}") elif source["type"] == "jdbc": return spark.read.format("jdbc") \ .option("url", source["connection"]) \ .option("dbtable", table_name) \ .load() # Konfigurasjon data_access = MultiCloudDataAccess( primary_source={ "name": "Fabric Lakehouse", "type": "lakehouse", "catalog": "lakehouse.default" }, fallback_sources=[ { "name": "AWS S3 via shortcut", "type": "s3_shortcut", "path": "abfss://workspace@onelake.dfs.fabric.microsoft.com/lakehouse/Files/external/aws" }, { "name": "On-premises SQL Server", "type": "jdbc", "connection": "jdbc:sqlserver://sql.internal.no:1433;database=AIDatalake" } ] ) df = data_access.read_data("training_features") ``` --- ## Data Residency and Sovereignty Compliance ### Norske og europeiske krav | Krav | Regulering | Implikasjon for krysssky | |------|-----------|------------------------| | **Data i Norge** | Sikkerhetsloven, NSM | Sensitiv data kan ikke lagres utenfor Norge | | **Data i EOS** | GDPR, Schrems II | Persondata i EOS/EU eller med tilstrekkelig beskyttelse | | **Overforingsmekanismer** | GDPR Art. 46 | SCC, Adequacy decisions for tredjeland | | **Suverenitet** | Nasjonal kontroll | Nokler og tilgang kontrollert av norsk personell | ### Dataklassifisering for krysssky ```python data_residency_rules = { "HEMMELIG": { "allowed_locations": ["Norway East"], "cross_cloud": False, "encryption": "Customer-managed keys (Norwegian HSM)" }, "FORTROLIG": { "allowed_locations": ["Norway East", "Norway West"], "cross_cloud": False, "encryption": "Customer-managed keys" }, "INTERN": { "allowed_locations": ["EU/EEA regions"], "cross_cloud": True, # Kun EU-regioner "encryption": "Platform-managed keys" }, "OFFENTLIG": { "allowed_locations": ["Alle"], "cross_cloud": True, "encryption": "Platform-managed keys" } } def validate_data_residency(data_classification: str, target_region: str) -> bool: """Valider at dataoverfoering overholder residency-krav.""" rules = data_residency_rules.get(data_classification) if not rules: return False if not rules["cross_cloud"]: return target_region in rules["allowed_locations"] return target_region in rules["allowed_locations"] or rules["allowed_locations"] == ["Alle"] ``` ### OneLake-regioner og dataplassering ```python # Sikre at Fabric workspace er i riktig region workspace_info = requests.get( f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}", headers=headers ).json() capacity_region = workspace_info.get("capacityRegion") print(f"Workspace region: {capacity_region}") # For norsk offentlig sektor: Krev Norway East assert capacity_region == "norwayeast", \ f"FEIL: Workspace er i {capacity_region}, krever norwayeast for sensitiv data" ``` --- ## Referanser - [OneLake shortcuts](https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcuts) -- Oversikt over shortcuts og stottede kilder - [Create an Amazon S3 shortcut](https://learn.microsoft.com/en-us/fabric/onelake/create-s3-shortcut) -- AWS S3-integrasjon - [Create an Amazon S3 compatible shortcut](https://learn.microsoft.com/en-us/fabric/onelake/create-s3-compatible-shortcut) -- S3-kompatible kilder - [Create shortcuts to on-premises data](https://learn.microsoft.com/en-us/fabric/onelake/create-on-premises-shortcut) -- On-premises via Data Gateway - [OneLake shortcut security](https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcut-security) -- Passthrough vs. delegated security - [OneLake, the OneDrive for data](https://learn.microsoft.com/en-us/fabric/onelake/onelake-overview) -- OneLake-arkitektur og one copy of data - [Microsoft Fabric integration pathways for ISVs](https://learn.microsoft.com/en-us/fabric/cicd/partners/partner-integration) -- Multi-cloud connector-oversikt - [External data sharing overview](https://learn.microsoft.com/en-us/fabric/governance/external-data-sharing-overview) -- Cross-tenant datadeling --- ## For Cosmo - **Bruk denne referansen** naar kunder har data i flere skyplattformer og trenger a integrere dem for AI-formaal uten a kopiere alt til Azure. - **OneLake shortcuts er primaerstrategien** for krysssky-dataintegrasjon. De unngaar dataduplisering, reduserer egress-kostnader med caching, og er enklere a vedlikeholde enn ETL-pipelines. - **Caching er essensielt for kostnader**: Aktiver shortcut-caching med passende retensjon (7 dager er god standard) for a redusere egress-kostnader med 70-90%. - **Datasuverenitet forst**: For norsk offentlig sektor, klassifiser data for du planlegger krysssky-integrasjon. HEMMELIG og FORTROLIG data skal aldri forlate Norge-regioner. - **On-premises Data Gateway** for legacy-systemer: Bruker kun utgaende HTTPS, ingen endringer i brannmurregler noedvendig. Stotter S3-kompatibel lagring og andre kilder bak brannmur.