# Cross-Cloud Data Integration

**Last updated:** 2026-02
**Status:** GA
**Category:** Data Engineering for AI

---

## Introduksjon

Mange organisasjoner opererer i multi-cloud-miljoer der data er spredt mellom Azure, AWS, Google Cloud og on-premises systemer. For AI-losninger som krever data fra flere kilder er det kritisk a ha en effektiv strategi for krysssky-dataintegrasjon. Microsoft Fabric sin OneLake og shortcuts-arkitektur gjor det mulig a virtuelt samle data fra ulike skyplattformer uten fysisk kopiering, noe som reduserer bade egress-kostnader og kompleksitet.

OneLake fungerer som et enkelt virtuelt datalake for hele organisasjonen, der shortcuts oppretter referanser til data i Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen2 og andre lagringskilder. Med intelligent caching kan Fabric redusere krysssky-datautgifter ved a lagre hyppig brukte filer lokalt i workspacet.

For norsk offentlig sektor, der datasuverenitet og datalagring i Norge/EOS er regulert, er krysssky-integrasjon spesielt sensitivt. Fabric sin fleksibilitet med shortcuts og caching gjor det mulig a integrere data fra ulike kilder uten a flytte sensitiv data ut av godkjente lagringsomrader.

---

## Multi-Cloud Connector Strategies

### OneLake Shortcuts som primaerstrategi

OneLake shortcuts er den foretrukne mekanismen for krysssky-dataintegrasjon i Fabric:

| Kilde | Shortcut-type | Autentisering | Caching |
|-------|-------------|---------------|---------|
| **Azure Data Lake Gen2** | ADLS shortcut | Service principal / Account key | Nei (samme sky) |
| **Amazon S3** | S3 shortcut | IAM Access Key / Secret | Ja (1-28 dager) |
| **Google Cloud Storage** | GCS shortcut | Service Account JSON | Ja (1-28 dager) |
| **S3-kompatibel** | S3-compatible | Access Key / Secret | Ja (1-28 dager) |
| **On-premises** | Via OPDG | On-premises Data Gateway | Ja (1-28 dager) |
| **Annen Fabric-tenant** | OneLake shortcut | Data Sharing invitation | Nei |

### Opprette shortcuts til ulike skyplattformer

```python
import requests

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

# --- AWS S3 Shortcut ---
s3_shortcut = {
    "name": "aws_training_data",
    "path": "Files/external/aws",
    "target": {
        "amazonS3": {
            "location": "https://my-bucket.s3.eu-north-1.amazonaws.com",
            "subpath": "/ai-data/training/",
            "connectionId": "s3-connection-id"
        }
    }
}

# --- Google Cloud Storage Shortcut ---
gcs_shortcut = {
    "name": "gcp_sensor_data",
    "path": "Files/external/gcp",
    "target": {
        "googleCloudStorage": {
            "location": "https://storage.googleapis.com/my-gcs-bucket",
            "subpath": "/sensor-readings/",
            "connectionId": "gcs-connection-id"
        }
    }
}

# --- On-premises via Data Gateway ---
onprem_shortcut = {
    "name": "onprem_legacy_data",
    "path": "Files/external/onprem",
    "target": {
        "amazonS3": {  # S3-kompatibel on-prem storage
            "location": "https://minio.internal.no:9000",
            "subpath": "/legacy-data/",
            "connectionId": "onprem-s3-connection-id"
        }
    }
}

# Opprett shortcuts
for shortcut in [s3_shortcut, gcs_shortcut, onprem_shortcut]:
    response = requests.post(
        f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/items/{lakehouse_id}/shortcuts",
        headers=headers,
        json=shortcut
    )
    print(f"Opprettet shortcut '{shortcut['name']}': {response.status_code}")
```

### Data Factory Connectors for ETL

For scenarier der shortcuts ikke er tilstrekkelig (transformasjon, filtrering, format-konvertering):

```json
{
    "name": "CopyFromAWSToFabric",
    "type": "Copy",
    "inputs": [
        {
            "referenceName": "AmazonS3ParquetSource",
            "type": "DatasetReference",
            "parameters": {
                "bucket": "ai-training-data",
                "prefix": "features/2026/02/"
            }
        }
    ],
    "outputs": [
        {
            "referenceName": "FabricLakehouseSink",
            "type": "DatasetReference",
            "parameters": {
                "tableName": "external_features"
            }
        }
    ],
    "typeProperties": {
        "source": {
            "type": "ParquetSource"
        },
        "sink": {
            "type": "LakehouseTableSink",
            "tableActionOption": "Append"
        }
    }
}
```

### Connector-oversikt for multi-cloud

| Kilde/Mal | Fabric Pipeline | Dataflow Gen2 | Shortcut | Direktelesing (Spark) |
|-----------|----------------|---------------|----------|----------------------|
| AWS S3 | Ja | Ja | Ja | Via shortcut |
| AWS Redshift | Ja | Ja | Nei | Via JDBC |
| Google BigQuery | Ja | Ja | Nei | Via JDBC |
| Google Cloud Storage | Ja | Ja | Ja | Via shortcut |
| Snowflake | Ja | Ja | Nei | Via JDBC/connector |
| Oracle | Ja (via OPDG) | Ja | Nei | Via JDBC |
| SAP HANA | Ja | Ja | Nei | Via JDBC |
| MongoDB Atlas | Ja | Ja | Nei | Via connector |

---

## Data Egress Cost Optimization

### Forstaa egress-kostnader

| Skyplattform | Intern egress | Kryssregion egress | Internet egress |
|-------------|--------------|-------------------|----------------|
| **Azure** | Gratis (samme region) | ~$0.02/GB | ~$0.087/GB |
| **AWS** | Gratis (samme AZ) | ~$0.01-0.02/GB | ~$0.09/GB |
| **GCP** | Gratis (samme region) | ~$0.01/GB | ~$0.08-0.12/GB |

### Kostnadsoptimaliseringsstrategier

```
Strategi 1: SHORTCUT CACHING (anbefalt)
+------------------------------------------+
| OneLake cacher filer fra S3/GCS lokalt  |
| - Forste lesing: Full egress-kostnad    |
| - Paafolgende: Ingen egress (cache hit) |
| - Retensjon: 1-28 dager konfigurerbar  |
| - Maks filstorrelse for cache: 1 GB    |
+------------------------------------------+

Strategi 2: PERIODISK KOPIERING
+------------------------------------------+
| Kopier data pa faste intervaller        |
| - Daglig/ukentlig batch-kopi            |
| - Komprimert overfoering (Parquet)      |
| - Kun inkrementelle endringer           |
+------------------------------------------+

Strategi 3: FEDERATED QUERY
+------------------------------------------+
| Spark foresporsel mot ekstern kilde     |
| - Pushdown-predikater reduserer volum   |
| - Partisjonspruning minimerer egress    |
| - Bruk for ad-hoc, ikke produksjon      |
+------------------------------------------+
```

### Konfigurere shortcut-caching

```python
# Aktiver caching for workspace via REST API
cache_config = {
    "settings": {
        "oneLake": {
            "shortcutCaching": {
                "enabled": True,
                "retentionPeriodInDays": 7  # 1-28 dager
            }
        }
    }
}

response = requests.patch(
    f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/settings",
    headers=headers,
    json=cache_config
)
```

### Beregn egress-kostnader

```python
def estimate_monthly_egress_cost(
    data_volume_gb: float,
    read_frequency_per_month: int,
    cache_hit_ratio: float,
    source_cloud: str,
    cost_per_gb: float = None
) -> dict:
    """
    Estimer maanedlig egress-kostnad for krysssky-data.
    """
    costs = {
        "aws_s3": 0.09,
        "gcp_gcs": 0.12,
        "azure_blob": 0.087
    }

    if cost_per_gb is None:
        cost_per_gb = costs.get(source_cloud, 0.10)

    # Uten caching
    total_reads_gb = data_volume_gb * read_frequency_per_month
    cost_without_cache = total_reads_gb * cost_per_gb

    # Med caching
    cache_misses = total_reads_gb * (1 - cache_hit_ratio)
    cost_with_cache = cache_misses * cost_per_gb

    savings = cost_without_cache - cost_with_cache

    return {
        "total_data_read_gb": total_reads_gb,
        "cost_without_cache_nok": round(cost_without_cache * 11, 2),  # ~11 NOK/USD
        "cost_with_cache_nok": round(cost_with_cache * 11, 2),
        "monthly_savings_nok": round(savings * 11, 2),
        "cache_hit_ratio": cache_hit_ratio,
        "recommendation": (
            "Aktiver caching" if savings > 100
            else "Caching gir liten gevinst"
        )
    }

# Eksempel: 500 GB data lest 30 ganger/maaned fra AWS
result = estimate_monthly_egress_cost(
    data_volume_gb=500,
    read_frequency_per_month=30,
    cache_hit_ratio=0.85,  # 85% cache hit med 7-dagers retensjon
    source_cloud="aws_s3"
)
# Besparelse: ~12,000 NOK/mnd med caching
```

---

## Consistency and Synchronization Patterns

### Eventual Consistency med Shortcuts

Shortcuts gir eventual consistency -- endringer i kildesystemet reflekteres ved neste lesing:

```
Tidslinje:
T0: AWS S3 oppdateres med nye filer
T1: Fabric leser via shortcut -> ser nye filer
T2: Cached versjon brukes (hvis caching er aktivert)
T3: Cache utloper -> ny lesing fra S3
```

### Change Data Capture (CDC) fra multi-cloud

```python
# CDC-moenster for synkronisering fra ekstern database
from pyspark.sql import functions as F

def incremental_sync_from_external(
    source_connection: str,
    source_table: str,
    target_table: str,
    watermark_column: str,
    watermark_table: str = "lakehouse.default.sync_watermarks"
):
    """
    Inkrementell synkronisering fra ekstern database til Fabric.
    """
    # 1. Hent siste watermark
    try:
        last_watermark = spark.sql(f"""
            SELECT MAX(watermark_value) as wm
            FROM {watermark_table}
            WHERE source_table = '{source_table}'
        """).collect()[0]["wm"]
    except Exception:
        last_watermark = "1970-01-01T00:00:00Z"

    # 2. Les inkrementelle endringer fra ekstern kilde
    new_data = spark.read \
        .format("jdbc") \
        .option("url", source_connection) \
        .option("dbtable", f"""
            (SELECT * FROM {source_table}
             WHERE {watermark_column} > '{last_watermark}')
        """) \
        .load()

    if new_data.count() == 0:
        print(f"Ingen nye endringer i {source_table}")
        return

    # 3. Skriv til Fabric Lakehouse
    new_data.write \
        .format("delta") \
        .mode("append") \
        .saveAsTable(target_table)

    # 4. Oppdater watermark
    new_watermark = new_data.agg(F.max(watermark_column)).collect()[0][0]
    spark.sql(f"""
        MERGE INTO {watermark_table} AS t
        USING (SELECT '{source_table}' as source_table,
                      '{new_watermark}' as watermark_value) AS s
        ON t.source_table = s.source_table
        WHEN MATCHED THEN UPDATE SET watermark_value = s.watermark_value
        WHEN NOT MATCHED THEN INSERT (source_table, watermark_value)
             VALUES (s.source_table, s.watermark_value)
    """)

    print(f"Synkronisert {new_data.count()} rader fra {source_table}")

# Synkroniser fra AWS RDS PostgreSQL
incremental_sync_from_external(
    source_connection="jdbc:postgresql://rds-instance.amazonaws.com:5432/aidata",
    source_table="public.sensor_readings",
    target_table="lakehouse.default.external_sensors",
    watermark_column="updated_at"
)
```

### Konflikthondtering for bi-direksjonell synk

| Strategi | Beskrivelse | Bruksomrade |
|----------|-------------|-------------|
| **Last-write-wins** | Siste endring vinner | Enkel, akseptabel tap |
| **Source-of-truth** | En kilde har prioritet | Master i ett system |
| **Merge** | Kombiner endringer intelligent | Komplekst, men komplett |
| **Event sourcing** | Alle endringer er hendelser | Historikk bevart |

---

## Hybrid Cloud Fallback Mechanisms

### On-premises Data Gateway

For tilgang til data bak brannmur eller i private nettverk:

```
Internet                     On-premises nettverk
+--------+                   +-------------------+
| Fabric | <-- HTTPS --> | Data Gateway     |
| Service|   (utgoende)    | (Windows-agent)  |
+--------+                   |                   |
                             | --> S3-kompatibel |
                             | --> SQL Server    |
                             | --> Filsystem     |
                             +-------------------+
```

**Viktig**: Gateway-en initierer utgaende tilkoblinger -- ingen inngoende regler kreves.

### Fallback-arkitektur

```python
class MultiCloudDataAccess:
    """
    Robust datatilgang med automatisk fallback mellom kilder.
    """

    def __init__(self, primary_source: dict, fallback_sources: list):
        self.primary = primary_source
        self.fallbacks = fallback_sources

    def read_data(self, table_name: str) -> "DataFrame":
        """
        Forsok a lese fra primaerkilde, fall tilbake til alternativer ved feil.
        """
        sources = [self.primary] + self.fallbacks

        for i, source in enumerate(sources):
            try:
                df = self._read_from_source(source, table_name)
                if i > 0:
                    print(f"ADVARSEL: Brukte fallback-kilde #{i}: {source['name']}")
                return df
            except Exception as e:
                print(f"Feil med kilde '{source['name']}': {e}")
                if i == len(sources) - 1:
                    raise RuntimeError(f"Alle kilder feilet for {table_name}")

    def _read_from_source(self, source: dict, table_name: str) -> "DataFrame":
        if source["type"] == "lakehouse":
            return spark.table(f"{source['catalog']}.{table_name}")
        elif source["type"] == "s3_shortcut":
            return spark.read.parquet(f"{source['path']}/{table_name}")
        elif source["type"] == "jdbc":
            return spark.read.format("jdbc") \
                .option("url", source["connection"]) \
                .option("dbtable", table_name) \
                .load()

# Konfigurasjon
data_access = MultiCloudDataAccess(
    primary_source={
        "name": "Fabric Lakehouse",
        "type": "lakehouse",
        "catalog": "lakehouse.default"
    },
    fallback_sources=[
        {
            "name": "AWS S3 via shortcut",
            "type": "s3_shortcut",
            "path": "abfss://workspace@onelake.dfs.fabric.microsoft.com/lakehouse/Files/external/aws"
        },
        {
            "name": "On-premises SQL Server",
            "type": "jdbc",
            "connection": "jdbc:sqlserver://sql.internal.no:1433;database=AIDatalake"
        }
    ]
)

df = data_access.read_data("training_features")
```

---

## Data Residency and Sovereignty Compliance

### Norske og europeiske krav

| Krav | Regulering | Implikasjon for krysssky |
|------|-----------|------------------------|
| **Data i Norge** | Sikkerhetsloven, NSM | Sensitiv data kan ikke lagres utenfor Norge |
| **Data i EOS** | GDPR, Schrems II | Persondata i EOS/EU eller med tilstrekkelig beskyttelse |
| **Overforingsmekanismer** | GDPR Art. 46 | SCC, Adequacy decisions for tredjeland |
| **Suverenitet** | Nasjonal kontroll | Nokler og tilgang kontrollert av norsk personell |

### Dataklassifisering for krysssky

```python
data_residency_rules = {
    "HEMMELIG": {
        "allowed_locations": ["Norway East"],
        "cross_cloud": False,
        "encryption": "Customer-managed keys (Norwegian HSM)"
    },
    "FORTROLIG": {
        "allowed_locations": ["Norway East", "Norway West"],
        "cross_cloud": False,
        "encryption": "Customer-managed keys"
    },
    "INTERN": {
        "allowed_locations": ["EU/EEA regions"],
        "cross_cloud": True,  # Kun EU-regioner
        "encryption": "Platform-managed keys"
    },
    "OFFENTLIG": {
        "allowed_locations": ["Alle"],
        "cross_cloud": True,
        "encryption": "Platform-managed keys"
    }
}

def validate_data_residency(data_classification: str, target_region: str) -> bool:
    """Valider at dataoverfoering overholder residency-krav."""
    rules = data_residency_rules.get(data_classification)
    if not rules:
        return False

    if not rules["cross_cloud"]:
        return target_region in rules["allowed_locations"]

    return target_region in rules["allowed_locations"] or rules["allowed_locations"] == ["Alle"]
```

### OneLake-regioner og dataplassering

```python
# Sikre at Fabric workspace er i riktig region
workspace_info = requests.get(
    f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}",
    headers=headers
).json()

capacity_region = workspace_info.get("capacityRegion")
print(f"Workspace region: {capacity_region}")

# For norsk offentlig sektor: Krev Norway East
assert capacity_region == "norwayeast", \
    f"FEIL: Workspace er i {capacity_region}, krever norwayeast for sensitiv data"
```

---

## Referanser

- [OneLake shortcuts](https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcuts) -- Oversikt over shortcuts og stottede kilder
- [Create an Amazon S3 shortcut](https://learn.microsoft.com/en-us/fabric/onelake/create-s3-shortcut) -- AWS S3-integrasjon
- [Create an Amazon S3 compatible shortcut](https://learn.microsoft.com/en-us/fabric/onelake/create-s3-compatible-shortcut) -- S3-kompatible kilder
- [Create shortcuts to on-premises data](https://learn.microsoft.com/en-us/fabric/onelake/create-on-premises-shortcut) -- On-premises via Data Gateway
- [OneLake shortcut security](https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcut-security) -- Passthrough vs. delegated security
- [OneLake, the OneDrive for data](https://learn.microsoft.com/en-us/fabric/onelake/onelake-overview) -- OneLake-arkitektur og one copy of data
- [Microsoft Fabric integration pathways for ISVs](https://learn.microsoft.com/en-us/fabric/cicd/partners/partner-integration) -- Multi-cloud connector-oversikt
- [External data sharing overview](https://learn.microsoft.com/en-us/fabric/governance/external-data-sharing-overview) -- Cross-tenant datadeling

---

## For Cosmo

- **Bruk denne referansen** naar kunder har data i flere skyplattformer og trenger a integrere dem for AI-formaal uten a kopiere alt til Azure.
- **OneLake shortcuts er primaerstrategien** for krysssky-dataintegrasjon. De unngaar dataduplisering, reduserer egress-kostnader med caching, og er enklere a vedlikeholde enn ETL-pipelines.
- **Caching er essensielt for kostnader**: Aktiver shortcut-caching med passende retensjon (7 dager er god standard) for a redusere egress-kostnader med 70-90%.
- **Datasuverenitet forst**: For norsk offentlig sektor, klassifiser data for du planlegger krysssky-integrasjon. HEMMELIG og FORTROLIG data skal aldri forlate Norge-regioner.
- **On-premises Data Gateway** for legacy-systemer: Bruker kun utgaende HTTPS, ingen endringer i brannmurregler noedvendig. Stotter S3-kompatibel lagring og andre kilder bak brannmur.