Course Backfill

This page documents the on-demand course snapshot backfill flow: how missing courses are resolved at runtime, how snapshots are used immediately in the UI, and how backfill candidates are collected and processed by a delayed pipeline workflow.
User-triggered snapshots are never canonical by default.

Problem statement#

Some courses referenced by students are missing from Sisukas because they are:

no longer present in the active Sisu course catalog, and
not included in the prebuilt historical dataset used by Sisukas.

These are typically older or retired courses whose course codes still appear in:

degree requirements,
archived study guides,
student plans imported from transcripts.

Course unit IDs are not stable over time, while course codes are user-facing and long-lived. A missing course is therefore treated as a data gap, not an error.

Design goals#

The snapshot backfill mechanism is designed to:

allow users to recover missing courses immediately
avoid polluting canonical course datasets
keep ingestion deterministic and replayable
maintain a strict separation between signal and source of truth

High-level architecture#

Runtime resolution
Missing courses are resolved via sisu-wrapper and stored as snapshots.
Candidate logging
Each successful resolution emits a backfill candidate record.
Delayed pipeline workflow
Candidates are processed after a fixed time window and fed into historical data maintenance.

sisu-wrapper: course resolution endpoint#

Sisukas relies on a dedicated resolver endpoint to fetch archived or inactive course metadata.

Endpoint#

GET /v1/courses/resolve

Parameters#

course_code (required) User-facing course code (e.g. CS-A1140).
lang (optional) Preferred language for display fields. All available language variants are still returned.

Resolution outcomes#

The resolver returns a status, not just raw data:

active – course exists in the active catalog
historical – course exists in the historical dataset
archived – course found via broader or archival resolution
ambiguous – multiple plausible matches (e.g. reused codes)
not_found – no matching course could be resolved

Each response includes:

resolution status and confidence
one or more candidate matches
provenance (source endpoints, retrieval time)
raw payload for audit and debugging

The resolver does not decide whether a course is canonical. It only reports what can be resolved and with what confidence.

Snapshot storage in Sisukas#

Resolved courses are stored as snapshots, not as canonical course units.

Snapshots are written to a dedicated store (for example course_snapshots) containing:

requested course_code
resolution status and confidence
full snapshot payload
fetch timestamp
expiration timestamp (TTL)
request counters

This ensures:

snapshots remain clearly distinguishable from curated data
outdated or incorrect data expires automatically
repeated requests for the same course are coalesced

UI behavior#

Missing course state#

When a course code is not found in canonical datasets, the UI shows:

a clear “course not in catalog” message
a primary action to fetch archived snapshot

Missing data is treated as a recoverable state, not a failure.

Snapshot-backed course pages#

If a snapshot exists:

the course page is rendered from snapshot data
a visible badge indicates the course is archived / snapshot-based
features requiring realisations or scheduling data may be disabled

Snapshots remain usable for:

favourites
plan inclusion
transcript-derived references

Ambiguous resolution#

If the resolver returns multiple matches:

the user is prompted to select the correct version
options are differentiated by:
- validity period
- name
- credits (if available)

The selected match is persisted to keep subsequent visits stable.

Expiration and refresh#

Snapshots are temporary by design.

archived or historical snapshots use a longer TTL
not_found results may use a short TTL to avoid repeated lookups
expired snapshots are re-fetched on demand

This prevents the snapshot store from becoming a second historical database.

Backfill candidate logging#

When Sisukas successfully resolves a snapshot (status not not_found), it emits a backfill candidate record.

Candidate records are:

append-only
lightweight (metadata only)
time-partitioned

These records are signals, not authoritative data. Canonical datasets are updated only via a separate maintenance pipeline.

Backfill candidate storage layout#

Candidates are written to Google Cloud Storage as daily JSONL files:

gs://sisukas-backfill-candidates/YYYY/MM/DD.jsonl

Each line represents a single resolution event:

{
  "requested_at": "2026-01-29T00:41:12Z",
  "course_code": "CS-A1140",
  "resolver_status": "archived",
  "resolver_confidence": "high",
  "course_unit_ids": ["aalto-CUR-209198-3124630"],
  "snapshot_hash": "sha256:...",
  "sisukas_version": "0.9.3",
  "source": "user-triggered"
}

Files are append-only and never modified after the day closes.

Time windowing and sliding#

The backfill pipeline operates on fixed time windows, not individual events.

Write window#

candidates are written during a fixed window (e.g. one calendar day)
files are open for appends only

Processing delay (X)#

files are not processed immediately
a fixed delay X (e.g. 24–48 hours) is applied to:
- absorb duplicates
- avoid race conditions
- allow safe replays

Processing window#

At now ≥ file_date + X, files become eligible for processing.

This behaves like a sliding window:

recent files are writable
older files are immutable and queued
processed files roll out of the active window

Pipeline workflow#

A scheduled pipeline (CI/CD or batch job) performs the following steps.

1. File discovery#

list candidate files in the bucket
select files older than X
skip files already processed

2. Deduplication and aggregation#

Across the selected window:

deduplicate by (course_code, course_unit_id)
count occurrences to measure demand
prefer the most recent snapshot hash

This converts noisy user requests into stable candidates.

3. Classification#

Candidates are classified into two buckets:

Auto-acceptable

high resolver confidence
single unambiguous unit ID
consistent metadata

Review-required

ambiguous resolution
reused course codes
conflicting metadata

4. Output artifacts#

The pipeline produces:

an accepted list for the next historical build
a review list for manual inspection
metrics (volume, popularity, failure rates)

Canonical datasets are updated only through this controlled process.

5. Post-processing#

After successful processing, files are:

moved to gs://sisukas-backfill-processed/, or
deleted after a retention period

The system remains replayable and auditable.

Guarantees and invariants#

Runtime requests never mutate canonical datasets
All pipeline inputs are append-only and time-partitioned
Historical builds are reproducible
User-triggered data is treated as signal, not truth

Why this design#

This design:

keeps the product responsive
keeps data ingestion boring and deterministic
avoids accidental semantic drift
scales naturally with usage

Most importantly, it ensures v1 stability even as historical gaps are discovered organically.