Skip to content

Reinforcement Learning Methods

JobShopLib provides Gymnasium environments for job-shop scheduling, including graph-based single-instance and multi-instance environments. The upstream package exposes SingleJobShopGraphEnv, MultiJobShopGraphEnv, reward observers, observation helpers, and rollout utilities under job_shop_lib.reinforcement_learning.

In OptPilot, the trained policy belongs to the method. The job-shop evaluator does not load the policy and does not know that reinforcement learning was used.

The integration pattern is:

OptPilot method receives job-shop case references from methodContext
-> method creates a JobShopLib RL environment for each case
-> method loads or constructs a policy
-> policy rolls out actions until the schedule is complete
-> method emits schedule-solution parameters
-> OptPilot evaluator validates and scores the schedules

Runnable Example

The repository now includes a runnable RL method:

examples/methods/job_shop_rl_stable_baselines/

Run it with the shared job-shop validation cases:

uv sync --extra examples
uv run optpilot validate examples/studies/job_shop_rl_stable_baselines.yaml
uv run optpilot run examples/studies/job_shop_rl_stable_baselines.yaml

The study uses environment_schedule_solution.yaml, so it is evaluated on the same ft06_small and la01_tiny cases and the same normalized_makespan objective as the JobShopLib solver methods.

The method trains on separate method-owned training samples:

Method settings fragment:

settings:
  algorithm: PPO
  trainInstances:
    - ../../environments/job_shop_scheduling/cases/train_tiny_a.yaml
    - ../../environments/job_shop_scheduling/cases/train_tiny_b.yaml
  maxJobs: 6
  totalTimesteps: 128
  seed: 0

trainInstances, maxJobs, totalTimesteps, and algorithm are method settings. OptPilot validates that settings is an object and passes it to the method; the RL method decides how to interpret those fields.

The example uses Stable-Baselines3 PPO with a small Gymnasium adapter around JobShopLib's single-instance RL environment. The adapter keeps the code focused on the OptPilot boundary: train on method-owned samples, roll out on the validation cases exposed through methodContext.references, and emit schedule-solution parameters.

What The Method Needs

An RL rollout method needs:

  • the validation case references from methodContext.references
  • a policy implementation or policy checkpoint
  • rollout settings such as deterministic sampling, maximum steps, and render mode
  • JobShopLib and the policy framework dependencies in the method runtime

Checkpoint-rollout method settings:

Method settings fragment:

settings:
  policyPath: user_catalog/methods/job_shop_rl_policy/policy.zip
  deterministic: true
  maxSteps: 10000
  renderMode: null

What The Method Produces

The method should produce the same schedule-solution candidate used by the other external solver methods:

Candidate spec payload fragment:

solutions:
  ft06_small:
    operations:
      - job: 0
        operation: 0
        machine: 0
        start: 0
        end: 3
  la01_tiny:
    operations:
      - job: 0
        operation: 0
        machine: 0
        start: 0
        end: 2

The environment validates the final schedule. It does not inspect the policy, reward function, neural network weights, or rollout trace unless the method explicitly saves those as evidence.

Why This Uses Schedule Solutions

The trained policy is not the candidate being evaluated by the job-shop environment. The policy is part of the method that produces a schedule for each validation case. The schedule is the environment-facing candidate.

This keeps the boundary consistent with constraint programming and metaheuristics:

Method family Method-owned work Environment-facing candidate
Constraint programming Build and solve a CP-SAT model. Schedule solution.
Metaheuristics Improve schedules through local search. Schedule solution.
Reinforcement learning Roll out a policy in a Gymnasium environment. Schedule solution.

Example Scope

The bundled RL study is runnable as a small Stable-Baselines3 training-and-rollout example. It is not meant to be a strong benchmark policy. For a serious experiment, use more training cases, train for many more steps, save the policy checkpoint as method evidence, and keep the validation cases fixed.