N8R Platform Architecture

From monolith to multi-service

Date: 2026-04-14
Spec: docs/specs/platform-architecture.md
Status: Design document — for team review

Why we're splitting

The current n8r-v2 is a single Cloudflare Worker with a D1 database.

This was right for proving the pipeline. It's wrong for production.

Problem Impact
D1 single-region writes 250-350ms passport lookups from distant PoPs
Workers can't hold connections SSE/WebSocket impossible — Console is broken
Single deployment Bug in Console can break the pipeline
Single repo Contributors see everything — no access control
No observability Events are in-memory only, no persistence

The target: three tiers

              TENANT TRAFFIC
                   │
             ┌─────▼─────┐
             │ Cloudflare │  ← ONLY public surface
             │   Edge     │    tenants know about
             └──┬─────┬──┘
                │     │
      ┌─────────▼─┐ ┌─▼──────────┐
      │  Workers   │ │  Fly.io    │  ← no public DNS
      │ (inject-   │ │  (SaaS +   │    no public IP
      │  orators)  │ │  Platform) │
      └────────────┘ └─────┬──────┘
                           │
                     ┌─────▼──────┐
                     │  Postgres  │  ← Fly.io private
                     └────────────┘    network only

               N8R STAFF → WireGuard VPN → Fly.io Ops

Service inventory

Service Tier Runtime Purpose
Gateway Edge CF Worker Route traffic, validate API keys, rate limit
Injectionator Edge CF Worker Pipeline execution (inspectors, destinations)
SaaS Console Core Fly.io + Hono Web console, config management, dashboards
Platform API Core Fly.io + Hono Tenant management, data APIs
Queue Consumer Core Fly.io Process edge telemetry
Ops UI Core Fly.io + Hono Staff-only management

SaaS, Platform API, Queue Consumer, and Ops UI are one Fly.io app with clear module boundaries. Extract later if needed.

Security: Cloudflare is the mask

Tenants never learn Fly.io addresses. Never resolve backend hostnames. Never see internal routing.

Cloudflare is the only public surface.

  • api.n8r.io → Cloudflare IP → Gateway Worker
  • Fly.io is an origin with no public DNS
  • Staff access via WireGuard VPN — completely separate path

Four trust boundaries

Boundary Transport Auth
Tenant → Edge TLS Per-tenant API key (rotatable, scoped)
Edge → Core HTTPS origin fetch Shared origin secret header
Edge internal CF service bindings Deploy-time trust (same account)
Staff → Core WireGuard VPN VPN cert + app-level auth

What never crosses a boundary

  • Postgres credentials → never reach edge or tenants
  • Origin secret → never reach tenants
  • Fly.io hostnames → never in public DNS
  • Tenant API keys (cleartext) → gateway strips before forwarding

Data layer

One source of truth, four storage services

Store Location Role
Postgres Fly.io Source of truth — tenants, configs, passports, tickets, audit
KV CF edge Read cache — configs, passport snapshots (sub-ms)
Queues Cloudflare Write buffer — telemetry, ticket events, passport deltas
R2 Cloudflare Cold storage — archived telemetry, logs, config versions

No D1. Avoids split-brain. KV + Queues gives edge performance without a second database.

Data flow

Read path (fast, every request)

Injectionator → KV (sub-ms, edge)
  → hit? return cached data
  → miss? fetch from SaaS API → populate KV with TTL

Write path (async, non-blocking)

Pipeline completes → enqueue via waitUntil():
  • ticket summary
  • passport trust delta
  • inspector telemetry
→ Worker responds immediately
→ Queue consumer on Fly.io → Postgres

Resilience

If Fly.io is briefly unreachable, injectionators keep running on cached KV data. Writes buffer in Queues.

Observability: six telemetry sources

Source Example events
Origin stations (CLI, WebSocket, SDK) origin.connected, origin.latency
Integrated apps (HR-Chatbot, IT-Helpdesk) app.conversation.started
Injectionators ticket.completed, verdict.block
Gateway gateway.request, gateway.blocked
SaaS console.config.updated
Platform platform.retention.rolled

All events follow one structured format:

{ "timestamp", "level", "service", "tenantId", "traceId", "event", "data" }

traceId propagates across all services for end-to-end correlation.

Data temperature model

Not all data deserves the same storage tier. Configurable per data type, per tenant.

Temp Store Default retention
Hot Postgres (indexed) 3-90 days
Warm Postgres (partitioned) 30 days - 1 year
Cold R2 (compressed) 1-2 years
Purge Deleted Per policy
  • Enterprise tenants negotiate custom retention (SOC2, healthcare)
  • Configs: always hot, old versions → cold (R2) as snapshots
  • Rolling is a platform ops background job, never tenant-facing

Codebase: multi-repo

Contributors must not see the whole codebase. Each service deploys independently.

Repository Purpose Deploys to
n8r-contracts Shared types, interfaces, schemas npm (@n8r/contracts)
n8r-gateway Edge gateway Worker Cloudflare
n8r-injector Pipeline runtime Worker Cloudflare
n8r-saas Console + Platform + Ops Fly.io
n8r-origins CLI, SDK, WebSocket clients npm / standalone
n8r-apps HR-Chatbot, IT-Helpdesk, etc. Per-app

The rule

Services only import from @n8r/contracts. Never from each other.

@n8r/contracts — the shared interface

n8r-contracts/
├── modules/
│   ├── origin.ts          OriginModule interface
│   ├── inspector.ts       InspectorModule interface
│   ├── destination.ts     DestinationModule interface
│   └── observer.ts        ObserverModule interface
├── events/
│   ├── taxonomy.ts        All PipelineEvent types
│   └── schemas.ts         JSON Schema validators
├── api/
│   ├── gateway.ts         Gateway ↔ service contracts
│   ├── telemetry.ts       Telemetry event shapes
│   └── config.ts          Config document shapes
└── identity/
    ├── passport.ts        Passport, PassportView
    └── ticket.ts          Ticket, Clip

Types + schemas + validators. Never implementation.

Current → target migration

Current (n8r-v2) Target
src/types/ n8r-contracts
src/core/ + src/modules/ n8r-injector
src/api/ + src/pages/ n8r-saas
Auth middleware n8r-gateway
D1 database Postgres (Fly.io)
Single wrangler.toml Per-service configs
In-memory event bus Cloudflare Queues → Fly.io consumer
No cold storage R2
No private networking Fly.io private network + WireGuard

12 platform decisions

# Decision
PD-1 Cloudflare is the only public surface
PD-2 Postgres is the single source of truth (no D1)
PD-3 KV is the edge read cache (TTL-based)
PD-4 Queues are the async write buffer
PD-5 Multi-repo with per-repo access control
PD-6 Services only import from @n8r/contracts
PD-7 Four trust boundaries with distinct auth models
PD-8 Structured logging standard across all services
PD-9 Temperature model — configurable per data type, per tenant
PD-10 Edge Workers are stateless
PD-11 Injectionators keep running if core is unreachable
PD-12 API keys use two-key rotation window

Sub-specs: build order

# Sub-spec Why this order
1 SaaS Console Foundation — Fly.io + Postgres + Hono
2 Edge Gateway Public surface — routing + API keys
3 Injectionator Runtime Data plane — stateless Workers + KV + Queues
4 Secure Comms Hardens connections from 1-3
5 Observability Telemetry pipeline, builds on event flows
6 Platform Ops Staff tooling, needs everything running

Each sub-spec goes through its own brainstorm → design → plan → implement cycle.

Next steps

  1. Review this specdocs/specs/platform-architecture.md
  2. Pick up sub-spec #1 — SaaS Console & Platform (spec-saas-console.md)
  3. Brainstorm it into a full spec
  4. Plan and implement — Fly.io app, Postgres schema, Hono server
  5. Repeat for each sub-spec in order

The full spec, all 6 sub-spec skeletons, and this deck are committed on feat/internal-mvp.

Questions?

Full spec: docs/specs/platform-architecture.md
Evolution spec (application architecture): docs/specs/architecture-evolution-design.md

Both specs are peer documents:

  • Evolution spec = what the software does (pipeline, modules, passports)
  • Platform spec = where it runs and how it's operated (infra, data, security)