feat(ms-ai-architect): sitemap-based KB change detection system

Adds a zero-dependency Node.js pipeline that polls Microsoft Learn sitemaps weekly to detect when source documentation changes. Replaces the broken mtime-based staleness check (all files had identical mtime after release). Components: - build-registry.mjs: extracts 1342 URLs from 387 reference files - poll-sitemaps.mjs: streams ~18 child sitemaps, matches against registry - report-changes.mjs: prioritized change report (critical/high/medium/low) - discover-new-urls.mjs: finds relevant new MS Learn pages not yet covered - run-weekly-update.mjs: orchestrator with --force/--discover/--dry-run Integration: - session-start hook reads change-report.json instead of broken mtime check - hook triggers background poll if >7 days since last check - generate-skills --update reads change report for targeted MCP updates Current stats: 69% match rate (924/1342 URLs tracked via sitemaps). ~31% unmatched due to Microsoft URL restructuring (ai-foundry/openai paths). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 21:19:51 +02:00 · 2026-04-09 21:19:51 +02:00 · f968f37be3
commit f968f37be3
parent 035255fc5d
13 changed files with 976 additions and 59 deletions
--- a/plugins/ms-ai-architect/.gitignore
+++ b/plugins/ms-ai-architect/.gitignore
@ -20,3 +20,4 @@ node_modules/
 # Runtime state
 .work/
 org/
+scripts/kb-update/data/
--- a/plugins/ms-ai-architect/CLAUDE.md
+++ b/plugins/ms-ai-architect/CLAUDE.md
@ -121,13 +121,28 @@ Se `references/architecture/recommended-mcp-servers.md` for detaljer.
 bash tests/validate-plugin.sh
 ```

-#### KB-ferskhet
+#### KB-ferskhet (sitemap-basert)
 ```bash
-# Sjekk stale kunnskapsfiler
-bash scripts/kb-staleness-check.sh
+# Ukentlig oppdatering: poll sitemaps → endringsrapport
+node scripts/kb-update/run-weekly-update.mjs --force

-# Vis kun prioriterte stale filer
-bash scripts/kb-staleness-check.sh --priority-only
+# Med discovery av nye relevante sider
+node scripts/kb-update/run-weekly-update.mjs --force --discover
+
+# Kun endringsrapport (etter polling)
+node scripts/kb-update/report-changes.mjs
+
+# Bygg/oppdater URL-registry fra referansefiler
+node scripts/kb-update/build-registry.mjs [--merge]
+```
+
+Systemet poller Microsoft Learn sitemaps ukentlig, sammenligner `<lastmod>` med filenes `Last updated:` header, og genererer en prioritert endringsrapport. Session-start hook trigger bakgrunns-poll automatisk hvis >7 dager siden siste.
+
+**Match rate:** ~69% av 1342 URLer matche mot sitemaps. ~31% (mest `azure/ai-foundry/openai/`-stier) finnes ikke i sitemaps pga. Microsofts URL-restrukturering.
+
+Legacy (deprecated):
+```bash
+bash scripts/kb-staleness-check.sh  # mtime-basert, upålitelig etter git clone
 ```

 #### E2E-regresjonstester
--- a/plugins/ms-ai-architect/README.md
+++ b/plugins/ms-ai-architect/README.md
@ -487,29 +487,37 @@ bash tests/capture-fixture.sh <source-file> <section-header> <output-dir>

 ### Knowledge Base Maintenance

-The plugin includes a systematic process for keeping reference documents current. See `docs/kb-update-policy.md` for the full policy (update frequencies per domain, procedures, quality gates).
+The plugin includes a sitemap-based change detection system that tracks when Microsoft Learn source pages are updated. This replaces the previous mtime-based staleness check.

-**Staleness checking:**
+**Automated change detection (sitemap-based):**

 ```bash
-# Human-readable report
-bash scripts/kb-staleness-check.sh
+# Weekly update: poll sitemaps → compare → generate change report
+node scripts/kb-update/run-weekly-update.mjs --force

-# Machine-readable JSON output
-bash scripts/kb-staleness-check.sh --json
+# Include discovery of new relevant pages
+node scripts/kb-update/run-weekly-update.mjs --force --discover

-# Write report to file
-bash scripts/kb-staleness-check.sh --json --output report.json
+# View change report only (after polling)
+node scripts/kb-update/report-changes.mjs
 ```

-**Knowledge base regeneration:**
+The session-start hook automatically triggers a background poll if >7 days since the last check.
+
+**How it works:**
+1. `build-registry.mjs` extracts 1342 unique `learn.microsoft.com` URLs from reference files
+2. `poll-sitemaps.mjs` fetches Microsoft Learn sitemaps and compares `<lastmod>` dates
+3. `report-changes.mjs` generates a prioritized list of files needing update
+4. `discover-new-urls.mjs` finds relevant new pages not yet covered
+
+**Knowledge base update:**

 ```bash
+# Incremental update based on change report (targets changed sources via MCP)
+/architect:generate-skills --update
+
 # Full regeneration via MCP research
 /architect:generate-skills
-
-# Incremental update (Edit existing files instead of rewriting)
-/architect:generate-skills --update
 ```

 Category-to-skill routing is defined in `scripts/skill-gen/category-skill-map.json` (20 categories mapped to 5 skills), used by the generate-skills workflow to place new reference documents in the correct skill directory.
--- a/plugins/ms-ai-architect/commands/generate-skills.md
+++ b/plugins/ms-ai-architect/commands/generate-skills.md
@ -234,7 +234,9 @@ When invoked with `--update`, the command updates existing stale files instead o

 **Workflow:**

-1. Run `bash scripts/kb-staleness-check.sh --json` to identify stale files
+1. Read `scripts/kb-update/data/change-report.json` for source-aware change detection
+   - If not available, fall back to `bash scripts/kb-staleness-check.sh --json`
+   - The change report contains `changed_urls` per file — use these for targeted MCP fetches
 2. Sort by priority (Critical > High > Medium > Low)
 3. For each stale file, dispatch an update agent with this prompt:

@ -247,10 +249,14 @@ Oppdater filen: {FILE_PATH}
 ## Eksisterende innhold (les først)
 Les filen med Read-verktøyet. Bevar strukturen.

+## Endrede kilde-URLer (hent disse først)
+{changed_urls from change-report.json — if available}
+
 ## Steg 1: Research
 Bruk MCP-verktøy for å verifisere og oppdatere:
-1. microsoft_docs_search — 2-3 søk for siste oppdateringer
-2. microsoft_docs_fetch — les oppdatert dokumentasjon
+1. microsoft_docs_fetch — hent de endrede kilde-URLene direkte (hvis tilgjengelig)
+2. microsoft_docs_search — 2-3 søk for siste oppdateringer
+3. microsoft_docs_fetch — les ytterligere oppdatert dokumentasjon ved behov

 ## Steg 2: Oppdater med Edit
 Bruk Edit-verktøyet (IKKE Write) for å:
@ -277,7 +283,9 @@ status: success|no_changes|failed

 Before generating new knowledge base content, check for stale files:

-1. Run `bash scripts/kb-staleness-check.sh` to identify stale files
+1. Read `scripts/kb-update/data/change-report.json` for source-aware staleness data
+   - This is generated by `node scripts/kb-update/run-weekly-update.mjs` (polls Microsoft Learn sitemaps)
+   - Fallback: `bash scripts/kb-staleness-check.sh` (mtime-based, less accurate)
 2. Prioritize regeneration of stale files by priority (Critical > Low)
-3. When regenerating a file, update its `Sist oppdatert:` header to today's date
-4. After regeneration, verify the file with the staleness checker
+3. When regenerating a file, update its `Last updated:` header to today's date
+4. After update, run `node scripts/kb-update/build-registry.mjs --merge` to refresh URL registry
--- a/plugins/ms-ai-architect/hooks/scripts/session-start-context.mjs
+++ b/plugins/ms-ai-architect/hooks/scripts/session-start-context.mjs
@ -3,8 +3,9 @@
 // Shows active utredning sessions and KB staleness on session start.
 // Output: plain text to stdout (advisory, never blocking).

-import { readdirSync, statSync, existsSync } from 'node:fs';
+import { readdirSync, readFileSync, existsSync } from 'node:fs';
 import { join, relative } from 'node:path';
+import { spawn } from 'node:child_process';

 const pluginRoot = process.env.CLAUDE_PLUGIN_ROOT || join(process.cwd());
 const cwd = process.cwd();
@ -40,23 +41,36 @@ if (existsSync(docsDir)) {
  }
 }

-// --- 2. Check KB staleness (stat mtime, no content reading) ---
-const staleLevels = { critical: 0, high: 0, medium: 0 };
+// --- 2. Check KB staleness (from sitemap-based change report) ---
 const now = Date.now();
 const DAY_MS = 24 * 60 * 60 * 1000;
+const staleLevels = { critical: 0, high: 0, medium: 0 };
+let lastPollDaysAgo = Infinity;

-const skillsDir = join(pluginRoot, 'skills');
-if (existsSync(skillsDir)) {
+const changeReportPath = join(pluginRoot, 'scripts', 'kb-update', 'data', 'change-report.json');
+if (existsSync(changeReportPath)) {
  try {
-    const skillDirs = readdirSync(skillsDir, { withFileTypes: true });
-    for (const skill of skillDirs) {
-      if (!skill.isDirectory()) continue;
-      const refsDir = join(skillsDir, skill.name, 'references');
-      if (!existsSync(refsDir)) continue;
-      countStaleFiles(refsDir, staleLevels, now);
+    const report = JSON.parse(readFileSync(changeReportPath, 'utf8'));
+    staleLevels.critical = report.by_priority?.critical || 0;
+    staleLevels.high = report.by_priority?.high || 0;
+    staleLevels.medium = report.by_priority?.medium || 0;
+    if (report.last_poll) {
+      lastPollDaysAgo = (now - new Date(report.last_poll).getTime()) / DAY_MS;
    }
  } catch {
-    // Ignore
+    // Ignore — fall back to showing no data
+  }
+}
+
+// Trigger background poll if >7 days since last check
+if (lastPollDaysAgo > 7) {
+  const updateScript = join(pluginRoot, 'scripts', 'kb-update', 'run-weekly-update.mjs');
+  if (existsSync(updateScript)) {
+    try {
+      spawn('node', [updateScript], { detached: true, stdio: 'ignore' }).unref();
+    } catch {
+      // Non-critical — silent fail
+    }
  }
 }

@ -117,7 +131,10 @@ if (staleLevels.high > 0) staleEntries.push(`${staleLevels.high} high`);
 if (staleLevels.medium > 0) staleEntries.push(`${staleLevels.medium} medium`);

 if (staleEntries.length > 0) {
-  parts.push(`KB stale: ${staleEntries.join(', ')}`);
+  const pollAge = lastPollDaysAgo < Infinity ? ` (pollet ${Math.floor(lastPollDaysAgo)}d siden)` : '';
+  parts.push(`KB: ${staleEntries.join(', ')} needs update${pollAge}`);
+} else if (lastPollDaysAgo > 7) {
+  parts.push('KB: poll overdue');
 }

 if (nearestDeadline) {
@ -154,26 +171,3 @@ function countFiles(dir, filename) {
  return count;
 }

-function countStaleFiles(dir, levels, now) {
-  try {
-    const entries = readdirSync(dir, { withFileTypes: true });
-    for (const entry of entries) {
-      const fullPath = join(dir, entry.name);
-      if (entry.isDirectory()) {
-        countStaleFiles(fullPath, levels, now);
-      } else if (entry.name.endsWith('.md')) {
-        try {
-          const mtime = statSync(fullPath).mtimeMs;
-          const ageDays = (now - mtime) / DAY_MS;
-          if (ageDays > 180) levels.critical++;
-          else if (ageDays > 90) levels.high++;
-          else if (ageDays > 60) levels.medium++;
-        } catch {
-          // Skip unreadable files
-        }
-      }
-    }
-  } catch {
-    // Ignore
-  }
-}
--- a/plugins/ms-ai-architect/scripts/kb-update/build-registry.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/build-registry.mjs
@ -0,0 +1,87 @@
+#!/usr/bin/env node
+// build-registry.mjs — Build URL registry from existing reference files.
+// Extracts all learn.microsoft.com URLs and maps them to their source reference files.
+// Usage: node build-registry.mjs [--merge]
+//   --merge: preserve existing sitemap_lastmod data, only add new URLs
+
+import { readdirSync, readFileSync, existsSync } from 'node:fs';
+import { join, relative, dirname } from 'node:path';
+import { fileURLToPath } from 'node:url';
+import { normalizeUrl, extractUrls } from './lib/url-normalize.mjs';
+import { loadRegistry, saveRegistry } from './lib/registry-io.mjs';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const PLUGIN_ROOT = join(__dirname, '..', '..');
+const SKILLS_DIR = join(PLUGIN_ROOT, 'skills');
+const merge = process.argv.includes('--merge');
+
+// Walk directory recursively for .md files
+function walkMd(dir) {
+  const results = [];
+  if (!existsSync(dir)) return results;
+  for (const entry of readdirSync(dir, { withFileTypes: true })) {
+    const full = join(dir, entry.name);
+    if (entry.isDirectory()) {
+      results.push(...walkMd(full));
+    } else if (entry.name.endsWith('.md') && entry.name !== 'SKILL.md') {
+      results.push(full);
+    }
+  }
+  return results;
+}
+
+// --- Main ---
+const existing = merge ? loadRegistry() : null;
+const urlToFiles = new Map(); // normalizedUrl → Set<relativePath>
+let totalFiles = 0;
+
+const skillDirs = readdirSync(SKILLS_DIR, { withFileTypes: true })
+  .filter(d => d.isDirectory())
+  .map(d => d.name);
+
+for (const skill of skillDirs) {
+  const refsDir = join(SKILLS_DIR, skill, 'references');
+  const files = walkMd(refsDir);
+
+  for (const file of files) {
+    totalFiles++;
+    const content = readFileSync(file, 'utf8');
+    const urls = extractUrls(content);
+    const relPath = relative(PLUGIN_ROOT, file);
+
+    for (const url of urls) {
+      if (!urlToFiles.has(url)) urlToFiles.set(url, new Set());
+      urlToFiles.get(url).add(relPath);
+    }
+  }
+}
+
+// Build registry
+const today = new Date().toISOString().split('T')[0];
+const registry = {
+  version: 1,
+  created_at: today,
+  last_poll: merge ? existing?.last_poll || null : null,
+  sitemap_state: merge ? existing?.sitemap_state || {} : {},
+  urls: {},
+};
+
+for (const [url, files] of urlToFiles) {
+  const prev = merge ? existing?.urls?.[url] : null;
+  registry.urls[url] = {
+    sitemap_lastmod: prev?.sitemap_lastmod || null,
+    reference_files: [...files].sort(),
+    status: prev?.status || 'unpolled',
+  };
+}
+
+saveRegistry(registry);
+
+// Stats
+const multiRef = [...urlToFiles.values()].filter(s => s.size > 1).length;
+console.log(`Registry built: ${urlToFiles.size} unique URLs from ${totalFiles} files`);
+console.log(`  URLs referenced by multiple files: ${multiRef}`);
+if (merge && existing?.urls) {
+  const newUrls = [...urlToFiles.keys()].filter(u => !existing.urls[u]).length;
+  console.log(`  New URLs added (merge): ${newUrls}`);
+}
--- a/plugins/ms-ai-architect/scripts/kb-update/discover-new-urls.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/discover-new-urls.mjs
@ -0,0 +1,168 @@
+#!/usr/bin/env node
+// discover-new-urls.mjs — Find relevant Microsoft Learn pages not yet in the registry.
+// Scans sitemaps for URLs matching relevance patterns, suggests skill/category mapping.
+// Usage: node discover-new-urls.mjs [--limit N]
+
+import { dirname, join } from 'node:path';
+import { fileURLToPath } from 'node:url';
+import { normalizeUrl } from './lib/url-normalize.mjs';
+import { loadRegistry, saveReport } from './lib/registry-io.mjs';
+import { streamSitemap, fetchSitemapIndex } from './lib/sitemap-stream.mjs';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const DATA_DIR = join(__dirname, 'data');
+const limitArg = process.argv.indexOf('--limit');
+const limit = limitArg !== -1 ? parseInt(process.argv[limitArg + 1], 10) : Infinity;
+
+// Relevance patterns — URL paths that indicate content this plugin should cover
+const INCLUDE = [
+  { pattern: /\/azure\/ai-foundry\//, skill: 'ms-ai-engineering', category: 'azure-ai-services' },
+  { pattern: /\/azure\/ai-services\//, skill: 'ms-ai-engineering', category: 'azure-ai-services' },
+  { pattern: /\/azure\/machine-learning\//, skill: 'ms-ai-engineering', category: 'mlops-genaiops' },
+  { pattern: /\/azure\/search\//, skill: 'ms-ai-engineering', category: 'rag-architecture' },
+  { pattern: /\/azure\/api-management\//, skill: 'ms-ai-engineering', category: 'api-management' },
+  { pattern: /\/azure\/azure-monitor\//, skill: 'ms-ai-governance', category: 'monitoring-observability' },
+  { pattern: /\/azure\/well-architected\//, skill: 'ms-ai-advisor', category: 'architecture' },
+  { pattern: /\/microsoft-copilot-studio\//, skill: 'ms-ai-advisor', category: 'copilot-extensibility' },
+  { pattern: /\/copilot\/microsoft-365\//, skill: 'ms-ai-advisor', category: 'copilot-extensibility' },
+  { pattern: /\/microsoft-365-copilot\//, skill: 'ms-ai-advisor', category: 'copilot-extensibility' },
+  { pattern: /\/security\/(?:ai|benchmark|engineering)\//, skill: 'ms-ai-security', category: 'ai-security-engineering' },
+  { pattern: /\/azure\/defender-for-cloud\//, skill: 'ms-ai-security', category: 'ai-security-engineering' },
+  { pattern: /\/purview\//, skill: 'ms-ai-governance', category: 'responsible-ai' },
+  { pattern: /\/semantic-kernel\//, skill: 'ms-ai-engineering', category: 'agent-orchestration' },
+  { pattern: /\/agent-framework\//, skill: 'ms-ai-engineering', category: 'agent-orchestration' },
+  { pattern: /\/fabric\/(?:data-engineering|data-science|real-time-intelligence)\//, skill: 'ms-ai-engineering', category: 'data-engineering' },
+  { pattern: /\/azure\/cosmos-db\//, skill: 'ms-ai-engineering', category: 'data-engineering' },
+  { pattern: /\/azure\/databricks\//, skill: 'ms-ai-engineering', category: 'data-engineering' },
+  { pattern: /\/entra\/(?:identity|agent)\//, skill: 'ms-ai-security', category: 'ai-security-engineering' },
+];
+
+// Exclude patterns — skip even if they match INCLUDE
+const EXCLUDE = [
+  /\/training\//,
+  /\/credentials\//,
+  /\/legal\//,
+  /\/previous-versions\//,
+  /\/archive\//,
+  /\/samples\//,
+  /\/release-notes?\//,
+  /\/whats-new/,
+  /\/changelog/,
+  /\/migrate\//,
+];
+
+function classifyUrl(url) {
+  if (EXCLUDE.some(p => p.test(url))) return null;
+  for (const rule of INCLUDE) {
+    if (rule.pattern.test(url)) {
+      return { skill: rule.skill, category: rule.category };
+    }
+  }
+  return null;
+}
+
+// Target sitemaps for discovery — same as poller, minus dotnet
+const TARGET_PREFIXES = [
+  'azure_en-us_',
+  'microsoft-copilot-studio_en-us_',
+  'security_en-us_',
+  'fabric_en-us_',
+  'power-platform_en-us_',
+  'ai_en-us_',
+  'copilot_en-us_',
+  'compliance_en-us_',
+  'agent-framework_en-us_',
+  'semantic-kernel_en-us_',
+  'entra_en-us_',
+  'purview_en-us_',
+];
+
+function extractChildName(loc) {
+  const match = loc.match(/\/_sitemaps\/([^/]+)\.xml$/);
+  return match ? match[1] : null;
+}
+
+// --- Main ---
+async function main() {
+  const registry = loadRegistry(DATA_DIR);
+  const knownUrls = new Set(Object.keys(registry.urls));
+  console.log(`Registry: ${knownUrls.size} known URLs`);
+
+  console.log('Fetching sitemap index...');
+  const indexEntries = await fetchSitemapIndex();
+
+  const targetChildren = indexEntries
+    .filter(e => {
+      const name = extractChildName(e.loc);
+      return name && TARGET_PREFIXES.some(p => name.startsWith(p));
+    });
+
+  console.log(`Scanning ${targetChildren.length} sitemaps for new relevant URLs...`);
+
+  const candidates = [];
+  const bySkill = {};
+
+  for (const child of targetChildren) {
+    const childName = extractChildName(child.loc);
+    let foundInChild = 0;
+
+    try {
+      for await (const entry of streamSitemap(child.loc)) {
+        const normalized = normalizeUrl(entry.loc);
+        if (!normalized || knownUrls.has(normalized)) continue;
+
+        const classification = classifyUrl(normalized);
+        if (!classification) continue;
+
+        candidates.push({
+          url: normalized,
+          lastmod: entry.lastmod,
+          sitemap: childName,
+          suggested_skill: classification.skill,
+          suggested_category: classification.category,
+        });
+        knownUrls.add(normalized); // Prevent dupes across sitemaps
+
+        bySkill[classification.skill] = (bySkill[classification.skill] || 0) + 1;
+        foundInChild++;
+
+        if (candidates.length >= limit) break;
+      }
+    } catch (err) {
+      console.error(`  ERROR scanning ${childName}: ${err.message}`);
+    }
+
+    if (foundInChild > 0) {
+      console.log(`  ${childName}: ${foundInChild} new candidates`);
+    }
+    if (candidates.length >= limit) break;
+  }
+
+  // Sort by lastmod descending (newest first)
+  candidates.sort((a, b) => (b.lastmod || '').localeCompare(a.lastmod || ''));
+
+  const report = {
+    generated_at: new Date().toISOString().split('T')[0],
+    new_candidates: candidates.length,
+    by_suggested_skill: bySkill,
+    candidates,
+  };
+
+  saveReport('discovery-report.json', report, DATA_DIR);
+
+  console.log(`\n=== Discovery Report ===`);
+  console.log(`New relevant URLs found: ${candidates.length}`);
+  console.log('By skill:', JSON.stringify(bySkill, null, 2));
+  if (candidates.length > 0) {
+    console.log('\nNewest 10:');
+    for (const c of candidates.slice(0, 10)) {
+      console.log(`  [${c.suggested_skill}/${c.suggested_category}] ${c.url}`);
+      console.log(`    lastmod: ${c.lastmod}`);
+    }
+  }
+}
+
+main().catch(err => {
+  console.error('Fatal error:', err.message);
+  process.exit(1);
+});
--- a/plugins/ms-ai-architect/scripts/kb-update/lib/registry-io.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/lib/registry-io.mjs
@ -0,0 +1,75 @@
+// registry-io.mjs — Atomic read/write for url-registry.json and report files.
+// Zero dependencies. Uses rename() for atomic writes.
+
+import { readFileSync, writeFileSync, renameSync, existsSync, mkdirSync } from 'node:fs';
+import { join, dirname } from 'node:path';
+import { fileURLToPath } from 'node:url';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const DEFAULT_DATA_DIR = join(__dirname, '..', 'data');
+
+/**
+ * Load the URL registry from disk.
+ * @param {string} [dataDir] — defaults to ../data/ relative to lib/
+ * @returns {object} parsed registry or empty scaffold
+ */
+export function loadRegistry(dataDir = DEFAULT_DATA_DIR) {
+  const path = join(dataDir, 'url-registry.json');
+  if (!existsSync(path)) {
+    return {
+      version: 1,
+      created_at: null,
+      last_poll: null,
+      sitemap_state: {},
+      urls: {},
+    };
+  }
+  return JSON.parse(readFileSync(path, 'utf8'));
+}
+
+/**
+ * Save the URL registry atomically (write to .tmp, then rename).
+ * @param {object} registry
+ * @param {string} [dataDir]
+ */
+export function saveRegistry(registry, dataDir = DEFAULT_DATA_DIR) {
+  ensureDir(dataDir);
+  const path = join(dataDir, 'url-registry.json');
+  const tmp = path + '.tmp';
+  writeFileSync(tmp, JSON.stringify(registry, null, 2) + '\n', 'utf8');
+  renameSync(tmp, path);
+}
+
+/**
+ * Load a JSON report file (change-report.json or discovery-report.json).
+ * @param {string} name — filename without path (e.g. 'change-report.json')
+ * @param {string} [dataDir]
+ * @returns {object|null} parsed JSON or null if not found
+ */
+export function loadReport(name, dataDir = DEFAULT_DATA_DIR) {
+  const path = join(dataDir, name);
+  if (!existsSync(path)) return null;
+  try {
+    return JSON.parse(readFileSync(path, 'utf8'));
+  } catch {
+    return null;
+  }
+}
+
+/**
+ * Save a JSON report file atomically.
+ * @param {string} name
+ * @param {object} data
+ * @param {string} [dataDir]
+ */
+export function saveReport(name, data, dataDir = DEFAULT_DATA_DIR) {
+  ensureDir(dataDir);
+  const path = join(dataDir, name);
+  const tmp = path + '.tmp';
+  writeFileSync(tmp, JSON.stringify(data, null, 2) + '\n', 'utf8');
+  renameSync(tmp, path);
+}
+
+function ensureDir(dir) {
+  if (!existsSync(dir)) mkdirSync(dir, { recursive: true });
+}
--- a/plugins/ms-ai-architect/scripts/kb-update/lib/sitemap-stream.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/lib/sitemap-stream.mjs
@ -0,0 +1,124 @@
+// sitemap-stream.mjs — Streaming XML parser for Microsoft Learn sitemaps.
+// Zero dependencies. Handles 47MB+ XML without loading into memory.
+// Yields { loc, lastmod } per <url> entry.
+
+import { get as httpsGet } from 'node:https';
+import { createGunzip } from 'node:zlib';
+
+const MAX_RETRIES = 3;
+const RETRY_DELAY_MS = 2000;
+
+/**
+ * Stream a sitemap XML file and yield { loc, lastmod } for each <url> entry.
+ * Works with both the sitemap index (<sitemapindex>) and child sitemaps (<urlset>).
+ * @param {string} url — full HTTPS URL to sitemap XML
+ * @yields {{ loc: string, lastmod: string|null }}
+ */
+export async function* streamSitemap(url) {
+  const chunks = await fetchWithRetry(url);
+  yield* parseSitemapEntries(chunks);
+}
+
+/**
+ * Fetch a sitemap fully into a buffer (most child sitemaps are 24-47MB).
+ * For the index (612KB) this is trivial. For children, we buffer to allow
+ * the generator to yield entries without backpressure issues.
+ * @param {string} url
+ * @param {number} [attempt]
+ * @returns {Promise<string>}
+ */
+function fetchWithRetry(url, attempt = 1) {
+  return new Promise((resolve, reject) => {
+    httpsGet(url, (res) => {
+      if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location) {
+        return fetchWithRetry(res.headers.location, attempt).then(resolve, reject);
+      }
+      if (res.statusCode !== 200) {
+        res.resume();
+        const err = new Error(`HTTP ${res.statusCode} for ${url}`);
+        if (attempt < MAX_RETRIES) {
+          return setTimeout(() => fetchWithRetry(url, attempt + 1).then(resolve, reject), RETRY_DELAY_MS * attempt);
+        }
+        return reject(err);
+      }
+
+      const stream = res.headers['content-encoding'] === 'gzip'
+        ? res.pipe(createGunzip())
+        : res;
+
+      const parts = [];
+      stream.on('data', (chunk) => parts.push(chunk.toString('utf8')));
+      stream.on('end', () => resolve(parts.join('')));
+      stream.on('error', (err) => {
+        if (attempt < MAX_RETRIES) {
+          setTimeout(() => fetchWithRetry(url, attempt + 1).then(resolve, reject), RETRY_DELAY_MS * attempt);
+        } else {
+          reject(err);
+        }
+      });
+    }).on('error', (err) => {
+      if (attempt < MAX_RETRIES) {
+        setTimeout(() => fetchWithRetry(url, attempt + 1).then(resolve, reject), RETRY_DELAY_MS * attempt);
+      } else {
+        reject(err);
+      }
+    });
+  });
+}
+
+/**
+ * Parse sitemap XML text and yield entries.
+ * Handles both <sitemapindex> (yields loc from <sitemap> blocks)
+ * and <urlset> (yields loc+lastmod from <url> blocks).
+ * Uses simple regex extraction — reliable for well-formed sitemap XML.
+ * @param {string} xml
+ * @yields {{ loc: string, lastmod: string|null }}
+ */
+function* parseSitemapEntries(xml) {
+  // Detect if this is a sitemap index or a urlset
+  const isSitemapIndex = xml.includes('<sitemapindex');
+
+  if (isSitemapIndex) {
+    // Parse <sitemap> blocks: <loc>...</loc> and <lastmod>...</lastmod>
+    const sitemapRegex = /<sitemap>\s*<loc>([^<]+)<\/loc>(?:\s*<lastmod>([^<]+)<\/lastmod>)?/g;
+    let match;
+    while ((match = sitemapRegex.exec(xml)) !== null) {
+      yield { loc: match[1].trim(), lastmod: match[2]?.trim() || null };
+    }
+  } else {
+    // Parse <url> blocks — extract <loc> and <lastmod>
+    // The XML structure per entry is:
+    //   <url><loc>...</loc><lastmod>...</lastmod><xhtml:link .../>...</url>
+    // We use a two-pass approach: find each <url>...</url> block, then extract fields
+    const urlBlockRegex = /<url>([\s\S]*?)<\/url>/g;
+    const locRegex = /<loc>([^<]+)<\/loc>/;
+    const lastmodRegex = /<lastmod>([^<]+)<\/lastmod>/;
+
+    let match;
+    while ((match = urlBlockRegex.exec(xml)) !== null) {
+      const block = match[1];
+      const locMatch = locRegex.exec(block);
+      if (!locMatch) continue;
+
+      const lastmodMatch = lastmodRegex.exec(block);
+      yield {
+        loc: locMatch[1].trim(),
+        lastmod: lastmodMatch ? lastmodMatch[1].trim() : null,
+      };
+    }
+  }
+}
+
+/**
+ * Fetch the sitemap index and return parsed entries.
+ * Convenience wrapper for the common "fetch index, decide which children to poll" pattern.
+ * @param {string} [indexUrl]
+ * @returns {Promise<Array<{ loc: string, lastmod: string|null }>>}
+ */
+export async function fetchSitemapIndex(indexUrl = 'https://learn.microsoft.com/_sitemaps/sitemapindex.xml') {
+  const entries = [];
+  for await (const entry of streamSitemap(indexUrl)) {
+    entries.push(entry);
+  }
+  return entries;
+}
--- a/plugins/ms-ai-architect/scripts/kb-update/lib/url-normalize.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/lib/url-normalize.mjs
@ -0,0 +1,69 @@
+// url-normalize.mjs — Consistent URL normalization for sitemap ↔ reference file matching.
+// Zero dependencies. Idempotent: normalizeUrl(normalizeUrl(x)) === normalizeUrl(x).
+
+/**
+ * Normalize a learn.microsoft.com URL to a canonical form.
+ * Rules applied in order:
+ *   1. Strip trailing punctuation leaked from markdown
+ *   2. Strip fragment (#anchor)
+ *   3. Strip ?view= and other query params
+ *   4. Remove /en-us/ locale prefix (store locale-free)
+ *   5. Lowercase
+ * @param {string} raw
+ * @returns {string|null} normalized URL, or null if not a learn.microsoft.com URL
+ */
+export function normalizeUrl(raw) {
+  if (!raw || typeof raw !== 'string') return null;
+  if (!raw.includes('learn.microsoft.com')) return null;
+
+  let url = raw;
+
+  // 1. Strip trailing punctuation that leaked from markdown context
+  url = url.replace(/[.,;:!?'")}\]]+$/, '');
+
+  // 2. Strip fragment
+  const hashIdx = url.indexOf('#');
+  if (hashIdx !== -1) url = url.slice(0, hashIdx);
+
+  // 3. Strip query parameters (?view=, ?tabs=, etc.)
+  const qIdx = url.indexOf('?');
+  if (qIdx !== -1) url = url.slice(0, qIdx);
+
+  // 4. Remove /en-us/ locale prefix — store locale-free for consistent matching
+  url = url.replace('://learn.microsoft.com/en-us/', '://learn.microsoft.com/');
+
+  // 5. Strip trailing slash for consistency
+  url = url.replace(/\/+$/, '');
+
+  // 6. Lowercase
+  url = url.toLowerCase();
+
+  return url;
+}
+
+/**
+ * Extract all learn.microsoft.com URLs from a text string.
+ * Handles all 5 citation formats found in reference files:
+ *   - Markdown links: [text](https://learn.microsoft.com/...)
+ *   - Bare URLs on their own line
+ *   - URL: prefix format
+ *   - Dash-bullet format
+ *   - Table cell format
+ * @param {string} text
+ * @returns {string[]} array of normalized unique URLs
+ */
+export function extractUrls(text) {
+  if (!text) return [];
+  const regex = /https:\/\/learn\.microsoft\.com[^\s)"'<>\]|]+/g;
+  const seen = new Set();
+  const results = [];
+  let match;
+  while ((match = regex.exec(text)) !== null) {
+    const normalized = normalizeUrl(match[0]);
+    if (normalized && !seen.has(normalized)) {
+      seen.add(normalized);
+      results.push(normalized);
+    }
+  }
+  return results;
+}
--- a/plugins/ms-ai-architect/scripts/kb-update/poll-sitemaps.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/poll-sitemaps.mjs
@ -0,0 +1,154 @@
+#!/usr/bin/env node
+// poll-sitemaps.mjs — Poll Microsoft Learn sitemaps for lastmod changes.
+// Updates url-registry.json with current sitemap_lastmod values.
+// Usage: node poll-sitemaps.mjs [--force] [--verbose]
+
+import { dirname, join } from 'node:path';
+import { fileURLToPath } from 'node:url';
+import { normalizeUrl } from './lib/url-normalize.mjs';
+import { loadRegistry, saveRegistry } from './lib/registry-io.mjs';
+import { fetchSitemapIndex, streamSitemap } from './lib/sitemap-stream.mjs';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const DATA_DIR = join(__dirname, 'data');
+const force = process.argv.includes('--force');
+const verbose = process.argv.includes('--verbose');
+
+// Target child sitemaps — covers all URL path prefixes in the registry.
+// Derived from analyzing which sitemaps contain our 1342 tracked URLs.
+const TARGET_PREFIXES = [
+  'azure_en-us_',
+  'microsoft-copilot-studio_en-us_',
+  'security_en-us_',
+  'fabric_en-us_',
+  'power-platform_en-us_',
+  'ai_en-us_',
+  'copilot_en-us_',
+  'compliance_en-us_',
+  'agent-framework_en-us_',
+  'semantic-kernel_en-us_',
+  'entra_en-us_',
+  'purview_en-us_',
+  'microsoftteams_en-us_',
+  'sharepoint_en-us_',
+  'microsoft-365_en-us_',
+  'training_en-us_',
+  'cloud-computing_en-us_',
+  'privacy_en-us_',
+  // dotnet_en-us_ excluded: 75 sitemaps, only 12 matches. Not worth weekly polling.
+  // Re-enable with --force if needed.
+];
+
+function extractChildName(loc) {
+  // https://learn.microsoft.com/_sitemaps/azure_en-us_7.xml → azure_en-us_7
+  const match = loc.match(/\/_sitemaps\/([^/]+)\.xml$/);
+  return match ? match[1] : null;
+}
+
+function isTargetChild(childName) {
+  return TARGET_PREFIXES.some(p => childName.startsWith(p));
+}
+
+// --- Main ---
+async function main() {
+  const registry = loadRegistry(DATA_DIR);
+  const urlIndex = new Map();
+  for (const [url, entry] of Object.entries(registry.urls)) {
+    urlIndex.set(url, entry);
+  }
+
+  console.log(`Registry loaded: ${urlIndex.size} URLs`);
+
+  // Step 1: Fetch sitemap index
+  console.log('Fetching sitemap index...');
+  const indexEntries = await fetchSitemapIndex();
+  console.log(`Sitemap index: ${indexEntries.length} child sitemaps found`);
+
+  // Step 2: Filter to target children
+  const targetChildren = indexEntries
+    .map(e => ({ ...e, name: extractChildName(e.loc) }))
+    .filter(e => e.name && isTargetChild(e.name));
+
+  console.log(`Target children: ${targetChildren.length}`);
+
+  let totalMatched = 0;
+  let totalUpdated = 0;
+  let childrenPolled = 0;
+  let childrenSkipped = 0;
+
+  // Step 3: Poll each child
+  for (const child of targetChildren) {
+    const prevState = registry.sitemap_state[child.name];
+
+    // Skip if unchanged since last check (unless --force)
+    if (!force && prevState && prevState.lastmod === child.lastmod) {
+      if (verbose) console.log(`  Skipping ${child.name} (unchanged since ${child.lastmod})`);
+      childrenSkipped++;
+      continue;
+    }
+
+    console.log(`  Polling ${child.name} (lastmod: ${child.lastmod})...`);
+    childrenPolled++;
+
+    let matchedInChild = 0;
+    let updatedInChild = 0;
+
+    try {
+      for await (const entry of streamSitemap(child.loc)) {
+        const normalized = normalizeUrl(entry.loc);
+        if (!normalized) continue;
+
+        const registryEntry = urlIndex.get(normalized);
+        if (!registryEntry) continue;
+
+        matchedInChild++;
+
+        // Update lastmod if changed
+        if (entry.lastmod && registryEntry.sitemap_lastmod !== entry.lastmod) {
+          registryEntry.sitemap_lastmod = entry.lastmod;
+          updatedInChild++;
+        }
+        registryEntry.status = 'tracked';
+      }
+    } catch (err) {
+      console.error(`  ERROR polling ${child.name}: ${err.message}`);
+      continue;
+    }
+
+    console.log(`    Matched: ${matchedInChild}, Updated: ${updatedInChild}`);
+    totalMatched += matchedInChild;
+    totalUpdated += updatedInChild;
+
+    // Record child state
+    registry.sitemap_state[child.name] = {
+      lastmod: child.lastmod,
+      checked_at: new Date().toISOString(),
+    };
+  }
+
+  // Step 4: Mark remaining unpolled URLs
+  let notInSitemap = 0;
+  for (const entry of Object.values(registry.urls)) {
+    if (entry.status === 'unpolled') {
+      entry.status = 'not_in_sitemap';
+      notInSitemap++;
+    }
+  }
+
+  // Step 5: Save
+  registry.last_poll = new Date().toISOString();
+  saveRegistry(registry, DATA_DIR);
+
+  // Summary
+  const tracked = Object.values(registry.urls).filter(u => u.status === 'tracked').length;
+  console.log('\n=== Poll Summary ===');
+  console.log(`Children polled: ${childrenPolled}, skipped: ${childrenSkipped}`);
+  console.log(`URLs matched: ${totalMatched}, lastmod updated: ${totalUpdated}`);
+  console.log(`Registry: ${tracked} tracked, ${notInSitemap} not in sitemap`);
+  console.log(`Match rate: ${((tracked / urlIndex.size) * 100).toFixed(1)}%`);
+}
+
+main().catch(err => {
+  console.error('Fatal error:', err.message);
+  process.exit(1);
+});
--- a/plugins/ms-ai-architect/scripts/kb-update/report-changes.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/report-changes.mjs
@ -0,0 +1,144 @@
+#!/usr/bin/env node
+// report-changes.mjs — Compare sitemap lastmod to reference file "Last updated:" headers.
+// Generates change-report.json and prints human-readable summary.
+// Usage: node report-changes.mjs [--json]
+
+import { readFileSync, existsSync } from 'node:fs';
+import { join, dirname } from 'node:path';
+import { fileURLToPath } from 'node:url';
+import { loadRegistry, saveReport } from './lib/registry-io.mjs';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const PLUGIN_ROOT = join(__dirname, '..', '..');
+const DATA_DIR = join(__dirname, 'data');
+const jsonOnly = process.argv.includes('--json');
+
+// Priority classification by file path patterns
+function getFilePriority(filePath) {
+  const lower = filePath.toLowerCase();
+  if (/cost|pricing|pris/.test(lower)) return 'critical';
+  if (/responsible-ai|governance|ai-security-(?:engineering|scoring)/.test(lower)) return 'high';
+  if (/platforms|copilot|azure-ai-services|agent-orchestration|rag|mlops|prompt-engineering|monitoring|performance/.test(lower)) return 'medium';
+  return 'low';
+}
+
+// Parse "Last updated:" header from a reference file
+const LAST_UPDATED_PATTERNS = [
+  /\*\*Last updated:\*\*\s*([\d-]+)/i,
+  /\*\*Sist (?:oppdatert|verifisert):\*\*\s*([\d-]+)/i,
+  /\*\*Dato:\*\*\s*([\d-]+)/i,
+];
+
+function parseLastUpdated(filePath) {
+  const fullPath = join(PLUGIN_ROOT, filePath);
+  if (!existsSync(fullPath)) return null;
+
+  // Only read first 500 bytes — header is always at the top
+  const content = readFileSync(fullPath, 'utf8').slice(0, 500);
+  for (const pattern of LAST_UPDATED_PATTERNS) {
+    const match = content.match(pattern);
+    if (match) {
+      const raw = match[1].trim();
+      // YYYY-MM → YYYY-MM-01, YYYY-MM-DD → as-is
+      return raw.length === 7 ? raw + '-01' : raw;
+    }
+  }
+  return null; // No date found — treat as always stale
+}
+
+// Priority sort order
+const PRIORITY_ORDER = { critical: 0, high: 1, medium: 2, low: 3 };
+
+// --- Main ---
+const registry = loadRegistry(DATA_DIR);
+
+if (!registry.last_poll) {
+  console.error('Registry has not been polled yet. Run poll-sitemaps.mjs first.');
+  process.exit(1);
+}
+
+// Group changed URLs by reference file
+const fileChanges = new Map(); // filePath → { changedUrls, newestChange }
+
+for (const [url, entry] of Object.entries(registry.urls)) {
+  if (!entry.sitemap_lastmod || entry.status !== 'tracked') continue;
+
+  for (const refFile of entry.reference_files) {
+    const fileDate = parseLastUpdated(refFile);
+    // If no date found, treat as always stale (date "0000-01-01")
+    const effectiveDate = fileDate || '0000-01-01';
+
+    if (entry.sitemap_lastmod > effectiveDate) {
+      if (!fileChanges.has(refFile)) {
+        fileChanges.set(refFile, { changedUrls: [], newestChange: entry.sitemap_lastmod, fileDate });
+      }
+      const fc = fileChanges.get(refFile);
+      fc.changedUrls.push({ url, sitemap_lastmod: entry.sitemap_lastmod });
+      if (entry.sitemap_lastmod > fc.newestChange) {
+        fc.newestChange = entry.sitemap_lastmod;
+      }
+    }
+  }
+}
+
+// Build report entries
+const files = [];
+for (const [path, changes] of fileChanges) {
+  const priority = getFilePriority(path);
+  const pathParts = path.split('/');
+  files.push({
+    path,
+    priority,
+    file_last_updated: changes.fileDate || 'unknown',
+    newest_source_change: changes.newestChange,
+    changed_url_count: changes.changedUrls.length,
+    changed_urls: changes.changedUrls.map(u => u.url),
+    skill: pathParts[1] || 'unknown',
+    category: pathParts[3] || 'unknown',
+  });
+}
+
+// Sort: priority first, then newest source change descending
+files.sort((a, b) => {
+  const pDiff = PRIORITY_ORDER[a.priority] - PRIORITY_ORDER[b.priority];
+  if (pDiff !== 0) return pDiff;
+  return b.newest_source_change.localeCompare(a.newest_source_change);
+});
+
+// Count by priority
+const byPriority = { critical: 0, high: 0, medium: 0, low: 0 };
+for (const f of files) byPriority[f.priority]++;
+
+const report = {
+  generated_at: new Date().toISOString().split('T')[0],
+  last_poll: registry.last_poll,
+  total_tracked: Object.values(registry.urls).filter(u => u.status === 'tracked').length,
+  total_not_in_sitemap: Object.values(registry.urls).filter(u => u.status === 'not_in_sitemap').length,
+  total_files_needing_update: files.length,
+  by_priority: byPriority,
+  files,
+};
+
+saveReport('change-report.json', report, DATA_DIR);
+
+if (jsonOnly) {
+  process.stdout.write(JSON.stringify(report, null, 2) + '\n');
+} else {
+  console.log(`\n=== KB Change Report (${report.generated_at}) ===`);
+  console.log(`Sources last polled: ${registry.last_poll}`);
+  console.log(`URLs tracked: ${report.total_tracked}/${Object.keys(registry.urls).length} (${report.total_not_in_sitemap} not in sitemap)`);
+  console.log(`Files needing update: ${files.length} (Critical: ${byPriority.critical}, High: ${byPriority.high}, Medium: ${byPriority.medium}, Low: ${byPriority.low})`);
+
+  if (files.length > 0) {
+    console.log('\nTop 20 by priority:');
+    for (const f of files.slice(0, 20)) {
+      console.log(`  [${f.priority.toUpperCase()}] ${f.path}`);
+      console.log(`    ${f.changed_url_count} source(s) changed. Latest: ${f.newest_source_change}. File: ${f.file_last_updated}`);
+    }
+    if (files.length > 20) {
+      console.log(`  ... and ${files.length - 20} more`);
+    }
+  }
+
+  console.log('\nRun: /architect:generate-skills --update to process updates');
+}
--- a/plugins/ms-ai-architect/scripts/kb-update/run-weekly-update.mjs
+++ b/plugins/ms-ai-architect/scripts/kb-update/run-weekly-update.mjs
@ -0,0 +1,70 @@
+#!/usr/bin/env node
+// run-weekly-update.mjs — Orchestrator for weekly KB update pipeline.
+// Runs: poll → report → (optional) discover, sequentially.
+// Usage: node run-weekly-update.mjs [--force] [--discover] [--dry-run]
+
+import { dirname, join } from 'node:path';
+import { fileURLToPath } from 'node:url';
+import { execFileSync } from 'node:child_process';
+import { loadRegistry } from './lib/registry-io.mjs';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const DATA_DIR = join(__dirname, 'data');
+
+const force = process.argv.includes('--force');
+const discover = process.argv.includes('--discover');
+const dryRun = process.argv.includes('--dry-run');
+
+const DAY_MS = 24 * 60 * 60 * 1000;
+
+function run(script, args = []) {
+  const fullPath = join(__dirname, script);
+  console.log(`\n--- Running ${script} ${args.join(' ')} ---`);
+  try {
+    execFileSync('node', [fullPath, ...args], {
+      stdio: 'inherit',
+      timeout: 10 * 60 * 1000, // 10 min max per step
+    });
+  } catch (err) {
+    console.error(`${script} failed: ${err.message}`);
+    process.exit(1);
+  }
+}
+
+// --- Main ---
+const registry = loadRegistry(DATA_DIR);
+const lastPoll = registry.last_poll ? new Date(registry.last_poll) : null;
+const daysSince = lastPoll ? (Date.now() - lastPoll.getTime()) / DAY_MS : Infinity;
+
+if (!force && daysSince < 7) {
+  console.log(`Last poll: ${Math.floor(daysSince)} day(s) ago. Next in ${Math.ceil(7 - daysSince)} day(s).`);
+  console.log('Use --force to run anyway.');
+  process.exit(0);
+}
+
+if (dryRun) {
+  console.log('DRY RUN — would execute:');
+  console.log('  1. poll-sitemaps.mjs' + (force ? ' --force' : ''));
+  console.log('  2. report-changes.mjs');
+  if (discover) console.log('  3. discover-new-urls.mjs');
+  process.exit(0);
+}
+
+// Ensure registry exists
+if (Object.keys(registry.urls).length === 0) {
+  console.log('Registry empty — building from reference files first...');
+  run('build-registry.mjs');
+}
+
+// Step 1: Poll sitemaps
+run('poll-sitemaps.mjs', force ? ['--force'] : []);
+
+// Step 2: Generate change report
+run('report-changes.mjs');
+
+// Step 3: Optional discovery
+if (discover) {
+  run('discover-new-urls.mjs', ['--limit', '500']);
+}
+
+console.log('\n=== Weekly update complete ===');