Investigation

Overview of slippi-ai
- a ML system designed to trained AI agents to play SSB Melee competitively
  - the system implements a two-stage training pipeline
    - that begins with imitation learning from human gameplay data
    - and progresses to RL through self-play
      
      for detailed information on subsystems theres System Architecture, Training Systems, Evaluation Systems, and Data Processing, we'll cover these later…
Project purpose and Scope
- its predecessor relied on purely deep reinforcement learning
  - this system benefits from behavioral cloning from Slippi replay files to create agents that exhibit more human-like gameplay patterns before refining their strategies through self-play
    
    raw replay data => AI agents includes data pre-processing, NN training, evaluation frameworks, and interactive applications such as netplay integration and twitch bot functionality
Training Pipeline overview

Stage 1: imitation learning
- the first stage uses behavorial cloning to train agents on human gameplay data extracted from Slippi replay files
  - orchestrated in `scripts
```
\train
```
    .py` and utilizes the `train_lib` module to implement supervised learning on state-action pairs derived from professional and high-level amateur gameplay
Stage 2: reinforcement learning
- the second stage takes the imitation-trained policy and refines it through self-play using proximal policy optimization PPO
  - handled by `slippi-ai
```
\rl
```
```
\run
```
    /py` for single-agent training and `slippi-ai
```
\rl
```
```
\train
```
    _two.py` for simultaneous two-agent training scenarios
Tech Stack and Dependencies

DL: TF Probability for NN training and inference NN: DeepMind Sonnet for high-level network architecture DATA: Pandas + PyArrow+Parquet for replay parsing and dataset manipulation TELEMETRY: Wandb for training metrics and model versioning DISTRIBUTED: Ray for scalable evaluation and training EMULATOR: libmelee for dolphin emulator communication CONFIG: fancyflags for CLI-arg mgmt

Key Entry Points

Training

scripts/train.py imitation learning from replay data
slippi_ai/rl/run.py single-agent reinforcement learning
slippi_ai/rl/train.py two-agent simultaneous training

Evaluation

scripts/eval_two.py local-agent evaluation and human play
scripts/run_evaluator.py batch evaluation with statistical analysis
scripts/netplay.py online play

Data processing

slippi_db/parse_local.py

A Walk thru the code

Stepping through the processes

parsing local slippi replays

Download the compressed ranked replays
- 75 - 125 GBs == 120k - 170k replays
- I sampled 3,300 replays for intial sanity test
Now we step through parse_local
- expects to be supplied an organized "root" dir:
  - includes Root/
    - /Raw, raw.json, Parsed, parsed.pkl, meta.json
  - Raw contains .zip/.7z archives of .slp files
  - raw.json file contains info about each raw archive
    - whether its been processed, if processed then removed to save space
  - Parsed dir populated by this script w/ a Parquet file for each .slp file
    - these files are named by the MD5 hash of .slp file
      - and are used by imitation learning
  - parsed.pkl pickle file contains metadata abt each processed .slp in Parsed
  - meta.json is created by scripts/make_localdataset
    - and used by imitation learning to know which files to train on

Dependencies

concurrent.futures

json os pickle from absl import app, flags _tqdm peppi_py from slippi_db import parse_peppi, preprocessing, utils, parsing_utils

functions

parse_slp(file, outputdir, tmpdir, compression, compressionlevel)
- result = dict(name=file.name)
- utils.md5
  - result.update( slpmd5 = md5, slpsize = len(slpbytes) )
- game = peppipy.readslippi
  - metadata = preproc.getmetadata(game)
  - istraining, reason = preproc.istrainingreplay(metadata)
    - result.update(metadata)
    - result.update(valid=true,istraining=istraining,nottrainingreason=reason)
- if istraining
  - game = parsepeppi.frompeppi(Game)
    - gamebytes =parsingutils.convertgame( game, compression=compression, compressionlevel=comrpessionlevel)
    - result.update(pq_size=len(gamebytes)), compression=comrepssion.value)
    )
    - with open(…'wb') as f
      - f.write(gamebytes)
- return result
parse_files
parse_chunk
parse_7zs
run_parsing

Steps

standardized directory hierarchy under a root directory
multi-threaded in-memory extraction and parsing
- derived qualities and filter candidate replays
  - exclude bad AI
  - damage threshold
  - winner detection
  - match deduplication

Training(s)

imitation learning

create experiment dir
- loads/restores checkpoints
build train/test data sources from replay files
create policy network and value function
alternates between training steps and eval
saves best models based on evaluation loss

walk thru code
- train_lib.train requires the Config struct
- Configuration
```
struct Config
    runtime::RuntimeConfig
    dataset::DatasetConfig
    data::DataConfig
    observation::ObservationConfig
    learner::LearnerConfig
    network::NetworkConfig
    controllerhead::ControllerHeadConfig
    embed::EmbedConfig
    policy::PolicyConfig
    valuefunc::ValueFunctionConfig
    maxnames::Integer
    exptroot
    exptdir
    tag
    restorepickle
    tested
    version::Integer
end
```
  RuntimeConfig
  - max runtime in seconds
  - interval for seconds between logging
  - interval for seconds between saving to disk
  - number for training steps between evaluations
  - number for batches per evaluation
  DataSetConfig
  - data directory for parsed peppiDb
  - metadata path for chunked data
  - test ratio for splitting up training data
  - allowed smash characters
  - allowed smash opponents
  - allowed player names
  - banned player names
  - yield swapped versions of each replay
  - mirror left/right in each replay
  - seed
  DataConfig
  - training batch size
  - unroll length
  - damage ratio
  - compressed
  - number of workers
  - balance characters bool
  ObservationConfig
  - animation::AnimationConfig AnimationConfig
    - mask::Boolean
  LearnerConfig
  - learning rate::Float
  - compile::Boolean
  - jit_compile::Boolean
  - decay rate::Float
  - value cost::Float
  - reward halflife::Float
  NetworkConfig
  - name='mlp'
  - mlp=MLP.config
  - lstm=LSTM.config
  - gru=GRU.config
  - res_lstm=DeepResLSTM.config
  - tx_like=Transformerlike.config
  ControllerHeadConfig
  - independent=Independent
    - models each component of the controller independently
  - autoregressive=AutoRegressive
    - samples components sequentially conditioned on past samples
  EmbedConfig
  - playerConfig
  - controllerConfig
  - randall::Bool
  - fountainofdreams::Bool
  - itemsConfig PlayerConfig
    - xy scale
    - shield scale
    - speed scale
    - with speeds::Bool
    - with controller::Bool
    - with nana::Bool
    - legacy jumps left::Bool
    ControllerConfig
    - axis spacing
    - shoulder spacing
    ItemsConfig
    - type::ItemsType ItemsType SKIP or FLAT or MLP
    - mlp sizes::Tuple{Int}
  PolicyConfig
  - train value head::Bool
  - delay::Integer
  ValueFunctionConfig
  - train separate network::Bool
  - separate network config::Bool
  - network::NetworkConfig
- Train
  - setup Wandb for logging
  - attempt to restore parameters using our pickle file
  - lots of config validation checks
  - create data sources for training and testing
    - setup TrainManager and TestManager
    TrainManager
    - Learner
    - DataSource
    - step kwargs
    - prefetch = 16 dataProfiler() stepProfiler() framesQueue queue.Queue(maxsize=prefetch) stopRequested threading.Event() dataThread threading.Thread(target=self.produce_frames)
      
      "used to produce tensors from frames" produce_frames(self): while stop not requested
      - batch, epoch = next(self.dataSource)
        
        frames = batch.frames
      - frames = frames._replace( state_action = self.learner.policy.embed_stateaction.from_state(frames.state_action))
        
        frames = utils.map_nt(tf.convert_totensor, frames)
      - data = (batch, epoch, frames)
        
        self.framesQueue.put(data)
      "stop requested" stop(self): self.stop_requested.set() self.data_thread.join()
    "step to get next frames in queue as input for batch training" step(self, compiled): batch, epoch, frames = self.frames_queue.get() stats, self.hidden_state = self.learner.step( frames, self.hidden_state, compiled, **kwargs) ) stats.update(epoch)
    
    return stats, batch
    
    inline funcs
    - get_tfstate
      - set_tfstate
      - save
      - maybe_log (do a test step and log both train and test stats)
      - maybe_eval

reinforcement learning

load imitation-trained policy as teacher
environment setup for dolphin emulator instances
actor-learner separates rollout collection from learning
performs policy gradient updates
- with KL constraints
self-play: update opponent with current policy

Q-learning

create sample policy and Q-policy
initialize values and Q-value networks
joint training: alternate between policy imitation and Q-learning
action sampling: use sample policy to generate action candidates
Q-policy updates: trains policy to select actions maximizing Q-values

Steps

raw slippi replays (post meta-extraction)
data source data library/module
TrainManager for IL and LearnerManager for RL, and Learner for Q-Learning
the training system produces trained policy instances
- it uses a hierarchial configuration approach with dataclasses that can be overriden via CLI flags
  - each training system has its own top-level config class that composes various specialized config components

Agent(s)

the agent system manages multiple types of agents with different execution models
- handling state synchronization and delayed inference
provides infra for managing AI agents during evaluation, gameplay, and real-time interaction with the Dolphin emulator
- handles agent instantiation, asynchronous inference, delay simulation, and controller output management
built around a hierarchy of agent classes that provide different levels of functionality and performance optimization

Basic Agent

provides the fundamental agent functionality by wrapping a `Policy` and tracking recurrent hidden state across timesteps

policy integration wraps policy for inference
state management tracks hiddenstate and prevcontroller
batching support handles batched inference
compilation TF fun JIT compilation
- Game embedding && needs_reset -> BasicAgent.step()
  - embed state action && hiddenstate -> policy.sample() -> updated hiddenstate
    - SampleOutputs -> controller state & logits

Delayed Agent/Async Delayed Agent

implements delay simulation to model realistic timing constraints between input perception and controller output DelayedAgent uses a PeekableQueue to buffer outputs and simulate processing delay

run synchronously
- game state -> DelayedAgent.push() -> BasicAgent.step() -> _outputqueue.put() <- initial queue fill <- dummy_{sampleoutputs}
- DelayedAgent.pop() -> _outputqueue.get() -> SampleOutputs
- batchsteps > 0 -> multistep batching -> _inputqueue -> BasicAgent.step()

AsyncDelayedAgent runs inference on a separate thread using threading pools and queues

runs asynchronously
- worker thread for multi-threading
- state queue for input queue for game states
- output queue for buffered controller outputs
- context manager for lifecycle , start, stop, run

Dolphin Integration Agent

the Agent class provides the highest level interface for interacting with dolphin emulator instances

Agent 2 Dolphin integration flow melee.GameState -> Agent.step() -> get_game() -> DelayedAgent.step() -> SampleOutputs -> embed_controller.decode() -> send_controller -> melee.Controller

name_codes -> name management -> nameChangeMode -> FIXED/CYCLE/RANDOM

Agent Factory Functions

factory functions for creating approximately configured agents

build_delayedagent() creates delayed agents w/ automatic name resolution and config
build_agent() creates fully-configured agent instances for Dolphin interaction
- opponent port, agent nametag, melee controller instance, saved agent state

Evaluation system

Evaluation system uses RolloutWorker and Evaluator classes to orchestrate agent execution across multiple envs

rollout() method coorinates between agents and environments to collect structured trajectory data

states game states over time shape: [T+1, B]
actions controller outputs shape: [T+1, B]
rewards computed rewards shape: [T, B]
is_resetting reset flags shape: [T+1, B]
initial_state agent's initial hidden state … [B]
delayed_actions buffered future actions [D,B]

Distributed Evaluation

RayEvaluator extends evaluation capabilities across multiple workers

RayEvaluator -> RayRolloutWorker.remote() -> Worker 1, 2, … N update_variables() -> Parameter Sync -> Worker 1, 2, … N rollout() -> ray.get() -> Merge Results -> Aggregated Metrics

Julia Slippi Ai

Investigation

Key Entry Points

A Walk thru the code

parsing local slippi replays

Dependencies

Steps

Training(s)

imitation learning

reinforcement learning

Q-learning

Steps

Agent(s)

Basic Agent

Delayed Agent/Async Delayed Agent

Dolphin Integration Agent

Agent Factory Functions

Evaluation system

Distributed Evaluation