This page documents the on-demand course snapshot backfill flow: how missing courses are resolved at runtime, how snapshots are used immediately in the UI, and how backfill candidates are collected and processed by a delayed pipeline workflow.
User-triggered snapshots are never canonical by default.
Problem statement#
Some courses referenced by students are missing from Sisukas because they are:
- no longer present in the active Sisu course catalog, and
- not included in the prebuilt historical dataset used by Sisukas.
These are typically older or retired courses whose course codes still appear in:
- degree requirements,
- archived study guides,
- student plans imported from transcripts.
Course unit IDs are not stable over time, while course codes are user-facing and long-lived. A missing course is therefore treated as a data gap, not an error.
Design goals#
The snapshot backfill mechanism is designed to:
- allow users to recover missing courses immediately
- avoid polluting canonical course datasets
- keep ingestion deterministic and replayable
- maintain a strict separation between signal and source of truth
High-level architecture#
Runtime resolution
Missing courses are resolved viasisu-wrapperand stored as snapshots.Candidate logging
Each successful resolution emits a backfill candidate record.Delayed pipeline workflow
Candidates are processed after a fixed time window and fed into historical data maintenance.
sisu-wrapper: course resolution endpoint#
Sisukas relies on a dedicated resolver endpoint to fetch archived or inactive course metadata.
Endpoint#
GET /v1/courses/resolveParameters#
course_code(required) User-facing course code (e.g.CS-A1140).lang(optional) Preferred language for display fields. All available language variants are still returned.
Resolution outcomes#
The resolver returns a status, not just raw data:
active– course exists in the active cataloghistorical– course exists in the historical datasetarchived– course found via broader or archival resolutionambiguous– multiple plausible matches (e.g. reused codes)not_found– no matching course could be resolved
Each response includes:
- resolution status and confidence
- one or more candidate matches
- provenance (source endpoints, retrieval time)
- raw payload for audit and debugging
The resolver does not decide whether a course is canonical. It only reports what can be resolved and with what confidence.
Snapshot storage in Sisukas#
Resolved courses are stored as snapshots, not as canonical course units.
Snapshots are written to a dedicated store (for example course_snapshots) containing:
- requested
course_code - resolution
statusandconfidence - full snapshot payload
- fetch timestamp
- expiration timestamp (TTL)
- request counters
This ensures:
- snapshots remain clearly distinguishable from curated data
- outdated or incorrect data expires automatically
- repeated requests for the same course are coalesced
UI behavior#
Missing course state#
When a course code is not found in canonical datasets, the UI shows:
- a clear “course not in catalog” message
- a primary action to fetch archived snapshot
Missing data is treated as a recoverable state, not a failure.
Snapshot-backed course pages#
If a snapshot exists:
- the course page is rendered from snapshot data
- a visible badge indicates the course is archived / snapshot-based
- features requiring realisations or scheduling data may be disabled
Snapshots remain usable for:
- favourites
- plan inclusion
- transcript-derived references
Ambiguous resolution#
If the resolver returns multiple matches:
the user is prompted to select the correct version
options are differentiated by:
- validity period
- name
- credits (if available)
The selected match is persisted to keep subsequent visits stable.
Expiration and refresh#
Snapshots are temporary by design.
- archived or historical snapshots use a longer TTL
not_foundresults may use a short TTL to avoid repeated lookups- expired snapshots are re-fetched on demand
This prevents the snapshot store from becoming a second historical database.
Backfill candidate logging#
When Sisukas successfully resolves a snapshot (status not not_found), it emits a backfill candidate record.
Candidate records are:
- append-only
- lightweight (metadata only)
- time-partitioned
These records are signals, not authoritative data. Canonical datasets are updated only via a separate maintenance pipeline.
Backfill candidate storage layout#
Candidates are written to Google Cloud Storage as daily JSONL files:
gs://sisukas-backfill-candidates/YYYY/MM/DD.jsonlEach line represents a single resolution event:
{
"requested_at": "2026-01-29T00:41:12Z",
"course_code": "CS-A1140",
"resolver_status": "archived",
"resolver_confidence": "high",
"course_unit_ids": ["aalto-CUR-209198-3124630"],
"snapshot_hash": "sha256:...",
"sisukas_version": "0.9.3",
"source": "user-triggered"
}Files are append-only and never modified after the day closes.
Time windowing and sliding#
The backfill pipeline operates on fixed time windows, not individual events.
Write window#
- candidates are written during a fixed window (e.g. one calendar day)
- files are open for appends only
Processing delay (X)#
files are not processed immediately
a fixed delay
X(e.g. 24–48 hours) is applied to:- absorb duplicates
- avoid race conditions
- allow safe replays
Processing window#
At now ≥ file_date + X, files become eligible for processing.
This behaves like a sliding window:
- recent files are writable
- older files are immutable and queued
- processed files roll out of the active window
Pipeline workflow#
A scheduled pipeline (CI/CD or batch job) performs the following steps.
1. File discovery#
- list candidate files in the bucket
- select files older than
X - skip files already processed
2. Deduplication and aggregation#
Across the selected window:
- deduplicate by
(course_code, course_unit_id) - count occurrences to measure demand
- prefer the most recent snapshot hash
This converts noisy user requests into stable candidates.
3. Classification#
Candidates are classified into two buckets:
Auto-acceptable
- high resolver confidence
- single unambiguous unit ID
- consistent metadata
Review-required
- ambiguous resolution
- reused course codes
- conflicting metadata
4. Output artifacts#
The pipeline produces:
- an accepted list for the next historical build
- a review list for manual inspection
- metrics (volume, popularity, failure rates)
Canonical datasets are updated only through this controlled process.
5. Post-processing#
After successful processing, files are:
- moved to
gs://sisukas-backfill-processed/, or - deleted after a retention period
The system remains replayable and auditable.
Guarantees and invariants#
- Runtime requests never mutate canonical datasets
- All pipeline inputs are append-only and time-partitioned
- Historical builds are reproducible
- User-triggered data is treated as signal, not truth
Why this design#
This design:
- keeps the product responsive
- keeps data ingestion boring and deterministic
- avoids accidental semantic drift
- scales naturally with usage
Most importantly, it ensures v1 stability even as historical gaps are discovered organically.