Expand description
Scheduler implementation
Implements a per-CPU round-robin scheduler for Strat9-OS with support for cooperative and preemptive multitasking.
§Preemption design
The timer interrupt (100Hz) calls maybe_preempt() which picks the next
task and performs a context switch. Interrupts are disabled while the
scheduler lock is held to prevent deadlock on single-core systems:
yield_task(): CLI → lock → pick next → TSS/CR3 → unlock → switch_context → restore IF- Timer handler: CPU already cleared IF → lock → pick next → TSS/CR3 → unlock → switch_context
Each task has its own 16KB kernel stack. Callee-saved registers are
pushed/popped by switch_context(). CpuContext only stores saved_rsp.
TODO(v3 scheduler):
- API stabilization before adding more features:
- freeze scheduler command syntax.
- add a small machine-friendly output format (key=value) for scripts/debug.
- observability v2:
- per-class latency/wait histograms.
- one structured dump format (instead of free-form text logs) for top/debug.
- targeted scheduler tests (high priority):
- config validation/reject paths (class/policy map).
- ready-task migration on class-table updates.
- SMP steal/preempt non-regression.
- only then: CPU affinity (first truly useful advanced scheduler feature).
Legacy backlog:
- class registry v2:
- dynamic add/remove/reorder with validation and safe reject path.
- policy->class mapping as runtime registry (not only static enum mapping).
- atomic class-table migration:
- RCU/STW swap + migration of queued tasks across classes.
- preserve per-task accounting (vruntime, rt budget, wake deadlines).
- balancing v2:
- dedicated balancer module, per-class steal policy, CPU affinity masks.
- NUMA-aware placement (future) and stronger anti-thrashing controls.
- SMP hardening:
- explicit lock hierarchy doc + assertions.
- improved resched IPI batching/coalescing policy tuning.
- observability v2:
- latency/wait-time histograms per class + structured trace dump.
- shell/top integration over stable snapshot API.
- tests:
- deterministic migration/policy-remap/SMP-steal suites.
- fairness/starvation long-run regression in test ISO.
Optimization roadmap (stability-first, incremental):
- Lock contention reduction (highest ROI, low risk)
- keep scheduler critical sections minimal: compute decisions under lock, execute expensive side effects (IPI, signal delivery, cleanup) after unlock.
- split hot paths into tiny helpers with explicit “lock held / lock free” contract.
- add/track contention counters in every try_lock fallback path.
- Wakeup path scalability (only after strong guards)
- re-introduce deadline index behind a runtime feature flag (default OFF).
- enforce single writer API for wake deadlines (no direct field stores in syscalls).
- add strict invariants:
- if task has deadline != 0, index contains task exactly once.
- on wake/kill/exit/resume, deadline is removed from index and field cleared.
- keep safe fallback scan path available and switchable at runtime.
- Scheduler observability for regressions
- keep stable key=value output for scripts (
scheduler metrics kv,scheduler dump kv). - expose blocked-task ids and per-cpu preempt causes to diagnose stalls quickly.
- include boot-phase and lock-miss counters in all dump modes.
- keep stable key=value output for scripts (
- Balancing/pick optimizations
- tune steal hysteresis/cooldown with metrics, avoid ping-pong migration.
- avoid counting idle task as runnable load for CPU selection.
- add bounded per-tick work budgets to prevent long interrupt latency tails.
- Safety rails before each optimization lands
- ship each optimization in one isolated patchset with rollback switch.
- validate with targeted scenarios:
- boot + shell responsiveness,
- timeout-heavy workload (poll/futex/nanosleep),
- SMP preempt/steal stress.
- if any regression appears, disable feature first, debug second.
Structs§
- CpuUsage
Snapshot - Scheduler
- The round-robin scheduler (per-CPU queues)
- Scheduler
Metrics Snapshot - Scheduler
State Snapshot
Enums§
- Wait
Child Result - Result of a non-blocking wait on child exit.
Functions§
- add_
task - Add a task to the scheduler
- add_
task_ with_ parent - Add a task and register a parent/child relation.
- block_
current_ task - Block the current task and yield to the scheduler.
- class_
table - Return the scheduler class-table currently in use.
- clear_
task_ wake_ deadline - Performs the clear task wake deadline operation.
- configure_
class_ table - Configure scheduler class pick/steal order at runtime.
- cpu_
usage_ snapshot - Performs the cpu usage snapshot operation.
- create_
session - Create a new session for the calling task.
- current_
pgid - Get the current process group id.
- current_
pid - Get the current process ID (POSIX pid).
- current_
sid - Get the current session id.
- current_
task_ clone - Get the current task (cloned Arc), if any.
- current_
task_ clone_ spin_ debug - Debug-only blocking variant used to diagnose early ring3 entry stalls.
- current_
task_ clone_ try - Best-effort, non-blocking variant of
current_task_clone. - current_
task_ id - Get the current task’s ID (if any task is running).
- current_
task_ id_ try - Get the current task’s ID without blocking (safe for exceptions).
- current_
tid - Get the current thread ID (POSIX tid).
- debug_
scheduler_ lock_ addr - Returns the scheduler lock address for deadlock tracing.
- exit_
current_ task - Mark the current task as Dead and yield to the scheduler.
- finish_
switch - Called immediately after a context switch completes (in the new task’s context). This safely re-queues the previously running task now that its state is fully saved.
- flush_
deferred_ silo_ cleanups - get_
all_ tasks - Get a list of all tasks in the system (for timer checking). Returns None if scheduler is not initialized or currently locked.
- get_
parent_ id - Get parent task ID for a child task.
- get_
parent_ pid - Get parent process ID for a child task.
- get_
pgid_ by_ pid - Resolve a PID to the current process group id.
- get_
sid_ by_ pid - Resolve a PID to the current session id.
- get_
task_ by_ id - Get a task by its TaskId (if still registered).
- get_
task_ by_ pid - Resolve a POSIX pid to the corresponding task.
- get_
task_ id_ by_ pid - Resolve a POSIX pid to internal TaskId.
- get_
task_ id_ by_ tid - Resolve a POSIX tid to the corresponding internal task id.
- get_
task_ ids_ in_ pgid - Collect task IDs that currently belong to process group
pgid. - init_
scheduler - Initialize the scheduler
- kill_
task - Kill a task by ID (best-effort).
- log_
state - Dump per-cpu scheduler queues for tracing/debug.
- maybe_
preempt - Called from the timer interrupt handler (or a resched IPI) to potentially preempt the current task.
- note_
try_ lock_ fail - Performs the note try lock fail operation.
- reset_
scheduler_ metrics - Performs the reset scheduler metrics operation.
- resume_
task - Resume a previously suspended task by ID.
- schedule
- Start the scheduler (called from kernel_main)
- schedule_
on_ cpu - Performs the schedule on cpu operation.
- scheduler_
metrics_ snapshot - Performs the scheduler metrics snapshot operation.
- set_
process_ group - Set process group id for
target_pid(or current ifNone). - set_
task_ sched_ policy - Update a task scheduling policy and requeue if needed.
- set_
task_ wake_ deadline - Sets task wake deadline.
- set_
verbose - Enable or disable verbose scheduler tracing.
- state_
snapshot - Structured scheduler state snapshot for shell/top/debug tooling.
- suspend_
task - Suspend a task by ID (best-effort).
- ticks
- Get the current tick count
- timer_
tick - Timer interrupt handler - called from interrupt context.
- try_
wait_ child - Try to reap a zombie child.
- verbose_
enabled - Return current verbose tracing state.
- wake_
task - Wake a blocked task by its ID.
- yield_
task - Yield the current task to allow other tasks to run (cooperative).