Skip to main content

Module scheduler

Module scheduler 

Source
Expand description

Scheduler implementation

Implements a per-CPU round-robin scheduler for Strat9-OS with support for cooperative and preemptive multitasking.

§Preemption design

The timer interrupt (100Hz) calls maybe_preempt() which picks the next task and performs a context switch. Interrupts are disabled while the scheduler lock is held to prevent deadlock on single-core systems:

  • yield_task(): CLI → lock → pick next → TSS/CR3 → unlock → switch_context → restore IF
  • Timer handler: CPU already cleared IF → lock → pick next → TSS/CR3 → unlock → switch_context

Each task has its own 16KB kernel stack. Callee-saved registers are pushed/popped by switch_context(). CpuContext only stores saved_rsp.

TODO(v3 scheduler):

  • API stabilization before adding more features:
    • freeze scheduler command syntax.
    • add a small machine-friendly output format (key=value) for scripts/debug.
  • observability v2:
    • per-class latency/wait histograms.
    • one structured dump format (instead of free-form text logs) for top/debug.
  • targeted scheduler tests (high priority):
    • config validation/reject paths (class/policy map).
    • ready-task migration on class-table updates.
    • SMP steal/preempt non-regression.
  • only then: CPU affinity (first truly useful advanced scheduler feature).

Legacy backlog:

  • class registry v2:
    • dynamic add/remove/reorder with validation and safe reject path.
    • policy->class mapping as runtime registry (not only static enum mapping).
  • atomic class-table migration:
    • RCU/STW swap + migration of queued tasks across classes.
    • preserve per-task accounting (vruntime, rt budget, wake deadlines).
  • balancing v2:
    • dedicated balancer module, per-class steal policy, CPU affinity masks.
    • NUMA-aware placement (future) and stronger anti-thrashing controls.
  • SMP hardening:
    • explicit lock hierarchy doc + assertions.
    • improved resched IPI batching/coalescing policy tuning.
  • observability v2:
    • latency/wait-time histograms per class + structured trace dump.
    • shell/top integration over stable snapshot API.
  • tests:
    • deterministic migration/policy-remap/SMP-steal suites.
    • fairness/starvation long-run regression in test ISO.

Optimization roadmap (stability-first, incremental):

  1. Lock contention reduction (highest ROI, low risk)
    • keep scheduler critical sections minimal: compute decisions under lock, execute expensive side effects (IPI, signal delivery, cleanup) after unlock.
    • split hot paths into tiny helpers with explicit “lock held / lock free” contract.
    • add/track contention counters in every try_lock fallback path.
  2. Wakeup path scalability (only after strong guards)
    • re-introduce deadline index behind a runtime feature flag (default OFF).
    • enforce single writer API for wake deadlines (no direct field stores in syscalls).
    • add strict invariants:
      • if task has deadline != 0, index contains task exactly once.
      • on wake/kill/exit/resume, deadline is removed from index and field cleared.
    • keep safe fallback scan path available and switchable at runtime.
  3. Scheduler observability for regressions
    • keep stable key=value output for scripts (scheduler metrics kv, scheduler dump kv).
    • expose blocked-task ids and per-cpu preempt causes to diagnose stalls quickly.
    • include boot-phase and lock-miss counters in all dump modes.
  4. Balancing/pick optimizations
    • tune steal hysteresis/cooldown with metrics, avoid ping-pong migration.
    • avoid counting idle task as runnable load for CPU selection.
    • add bounded per-tick work budgets to prevent long interrupt latency tails.
  5. Safety rails before each optimization lands
    • ship each optimization in one isolated patchset with rollback switch.
    • validate with targeted scenarios:
      • boot + shell responsiveness,
      • timeout-heavy workload (poll/futex/nanosleep),
      • SMP preempt/steal stress.
    • if any regression appears, disable feature first, debug second.

Structs§

CpuUsageSnapshot
Scheduler
The round-robin scheduler (per-CPU queues)
SchedulerMetricsSnapshot
SchedulerStateSnapshot

Enums§

WaitChildResult
Result of a non-blocking wait on child exit.

Functions§

add_task
Add a task to the scheduler
add_task_with_parent
Add a task and register a parent/child relation.
block_current_task
Block the current task and yield to the scheduler.
class_table
Return the scheduler class-table currently in use.
clear_task_wake_deadline
Performs the clear task wake deadline operation.
configure_class_table
Configure scheduler class pick/steal order at runtime.
cpu_usage_snapshot
Performs the cpu usage snapshot operation.
create_session
Create a new session for the calling task.
current_pgid
Get the current process group id.
current_pid
Get the current process ID (POSIX pid).
current_sid
Get the current session id.
current_task_clone
Get the current task (cloned Arc), if any.
current_task_clone_spin_debug
Debug-only blocking variant used to diagnose early ring3 entry stalls.
current_task_clone_try
Best-effort, non-blocking variant of current_task_clone.
current_task_id
Get the current task’s ID (if any task is running).
current_task_id_try
Get the current task’s ID without blocking (safe for exceptions).
current_tid
Get the current thread ID (POSIX tid).
debug_scheduler_lock_addr
Returns the scheduler lock address for deadlock tracing.
exit_current_task
Mark the current task as Dead and yield to the scheduler.
finish_switch
Called immediately after a context switch completes (in the new task’s context). This safely re-queues the previously running task now that its state is fully saved.
flush_deferred_silo_cleanups
get_all_tasks
Get a list of all tasks in the system (for timer checking). Returns None if scheduler is not initialized or currently locked.
get_parent_id
Get parent task ID for a child task.
get_parent_pid
Get parent process ID for a child task.
get_pgid_by_pid
Resolve a PID to the current process group id.
get_sid_by_pid
Resolve a PID to the current session id.
get_task_by_id
Get a task by its TaskId (if still registered).
get_task_by_pid
Resolve a POSIX pid to the corresponding task.
get_task_id_by_pid
Resolve a POSIX pid to internal TaskId.
get_task_id_by_tid
Resolve a POSIX tid to the corresponding internal task id.
get_task_ids_in_pgid
Collect task IDs that currently belong to process group pgid.
init_scheduler
Initialize the scheduler
kill_task
Kill a task by ID (best-effort).
log_state
Dump per-cpu scheduler queues for tracing/debug.
maybe_preempt
Called from the timer interrupt handler (or a resched IPI) to potentially preempt the current task.
note_try_lock_fail
Performs the note try lock fail operation.
reset_scheduler_metrics
Performs the reset scheduler metrics operation.
resume_task
Resume a previously suspended task by ID.
schedule
Start the scheduler (called from kernel_main)
schedule_on_cpu
Performs the schedule on cpu operation.
scheduler_metrics_snapshot
Performs the scheduler metrics snapshot operation.
set_process_group
Set process group id for target_pid (or current if None).
set_task_sched_policy
Update a task scheduling policy and requeue if needed.
set_task_wake_deadline
Sets task wake deadline.
set_verbose
Enable or disable verbose scheduler tracing.
state_snapshot
Structured scheduler state snapshot for shell/top/debug tooling.
suspend_task
Suspend a task by ID (best-effort).
ticks
Get the current tick count
timer_tick
Timer interrupt handler - called from interrupt context.
try_wait_child
Try to reap a zombie child.
verbose_enabled
Return current verbose tracing state.
wake_task
Wake a blocked task by its ID.
yield_task
Yield the current task to allow other tasks to run (cooperative).