This describes how we build a historical course realisation dataset (course instances / implementations) from Sisu + Aalto Course API, and how we process it into a frontend-friendly JSON file.
The pipeline is designed to be:
- interruptible (Ctrl+C at any time)
- resumable (continue from last checkpoint)
- append-only (JSONL crawl output)
- dedupable (post-pass by CUR id)
- normalizable (reduce to the same core fields as active
courses.json)
Why we keep a historical course record#
Originally, Sisukas only needed live Sisu data, since the focus was on discovering and planning upcoming courses. With the introduction of the year timeline, this assumption no longer holds.
Users now want to see and reason about past academic years, including courses they have already taken. However, Sisu’s published APIs are optimized for current and upcoming teaching, and older course implementations gradually disappear from the available endpoints.
The historical course record exists to support this shift. By preserving past course implementations as a stable dataset, Sisukas can display coherent timelines across years, allow users to keep track of completed courses, and provide context about how courses are actually offered over time. Storing this data also keeps the frontend fast and predictable, without relying on fragile or incomplete historical queries against upstream APIs.
Data sources#
1) Sisu KORI API (public)#
Used to discover historical course unit realisation ids (CUR ids).
Base:
https://sisu.aalto.fi/kori/apiWe use:
GET /course-units/{courseUnitId}to get completion methods →assessmentItemIdsGET /course-unit-realisations?assessmentItemId=...to list realisations (unfiltered, includes older years)
2) Aalto Course API (requires key)#
Used to fetch the detailed historical payload for each CUR id.
Base:
https://course.api.aalto.fi/api/sisu/v1We use:
GET /courseunitrealisations/{CUR_ID}?USER_KEY=...
The API key must be provided via environment variable (see below).
Output artifacts#
We produce three files:
Crawl output (append-only)
historical_realisations.jsonlOne JSON object per line (raw historical record).Deduped crawl output
historical_realisations_deduped.jsonlSame format, but duplicates removed byid(CUR id).Final normalized dataset
historical_courses.jsonA flat JSON array of reduced course objects (frontend-friendly).
0. Prerequisites#
Set the Course API key:
export AALTO_COURSE_API_KEY="..."Make sure you run from a place where python -m sisu_wrapper resolves to the local package.
1. Crawl: fetch historical records (resumable)#
This step:
reads
courses.jsonextracts unique
courseUnitIdsfor each course unit:
- fetches course unit metadata (assessment items)
- fetches all course-unit realisations for those assessment items
- filters them by date (by start date)
- fetches full records from Course API by CUR id
appends each fetched record to JSONL output
writes a checkpoint after each course unit (or every N course units)
Run:
cd sisu-wrapper
python -m sisu_wrapper \
--courses-json ../courses.json \
--out ../historical_realisations.jsonl \
--state ../historical_state.jsonResume after interruption#
If you Ctrl+C during the crawl, a checkpoint is written and the program exits cleanly.
Resume:
python -m sisu_wrapper \
--courses-json ../courses.json \
--out ../historical_realisations.jsonl \
--state ../historical_state.json \
--resumeRelevant flags#
--outJSONL file to append records to.--stateCheckpoint state JSON, stores thenext_indexto resume from.--resumeUsesnext_indexfrom--stateif available.--start-index NOverride resume state and start from a specific 0-based courseUnitId index.--from-date YYYY-MM-DD/--to-date YYYY-MM-DDDate filter for realisations (based on the realisation start date). Default range is configured in__main__.py(currently ends at 2025-12-31).
Notes on 404s (expected)#
Some CUR ids returned from Sisu discovery may not exist in the Course API. You may see errors like:
404 Client Error: Not Found for url: ... /courseunitrealisations/{CUR}
These are expected and are skipped (we continue crawling).
2. Deduplicate JSONL output#
Because the crawl is resumable and the output is append-only, resuming may cause some duplicates. Deduping is done by the record id (CUR id).
Run:
python scripts/dedupe_jsonl.py \
../historical_realisations.jsonl \
../historical_realisations_deduped.jsonlThis keeps the first occurrence of each CUR id and discards the rest.
3. Normalize to frontend format (flat list)#
The deduped JSONL is still “raw-ish”. We normalize it to match the minimal schema we use for active course objects.
Recommended output: a flat JSON array (not keyed/mapped).
Example schema (subset):
id,code,startDate,endDate,typenamesummary.prerequisites,summary.levelorganizationName,credits,courseUnitIdlanguageOfInstructionCodes,teachersenrolmentStartDate,enrolmentEndDate
Run your normalizer (or the flat transformer) to produce:
python scripts/transform_historical_flat.py \
../historical_realisations_deduped.jsonl \
../historical_courses.json \
--from 2022-09-01 --to 2025-12-31(If your transformer expects JSON rather than JSONL, either adapt it to read JSONL or convert JSONL → list first.)
Why a flat list?#
- simpler frontend store shape (
HistoricalCoursesStore: Course[]) - easy to filter/sort/search
- cheap to stream or chunk later
Suggested operational workflow#
For a full rebuild:
# 1) Crawl (can be interrupted/resumed)
python -m sisu_wrapper --courses-json ../courses.json --out ../historical_realisations.jsonl --state ../historical_state.json
# 2) Deduplicate
python scripts/dedupe_jsonl.py ../historical_realisations.jsonl ../historical_realisations_deduped.jsonl
# 3) Normalize
python scripts/transform_historical_flat.py ../historical_realisations_deduped.jsonl ../historical_courses.json --from 2022-09-01 --to 2025-12-31For incremental runs (when tweaking logic), keep the JSONL + dedupe pattern—don’t try to rebuild one giant JSON in-memory during crawl.
Debugging tips#
Test on a small slice first:
- use
--limit-course-units 20(if supported) - or
--start-indexwith a short range
- use
If you’re seeing rate limits / slowdowns:
- increase timeouts
- consider adding a small sleep or backoff
- later improvement: parallelize Course API fetches with a small thread pool + max concurrency cap