DSAMbayes Backtest Documentation

This directory contains lightweight documentation for the DSAMbayes-backtest library.

Current scope:

  • single-series time-series backtest workflows,
  • repo-external parity comparison of DSAMbayes and DSAMbayes-Charles-Dev,
  • common holdout scoring, parameter stability, and provisional recommendation stability.

Start here:

Current worked example:

Important limit:

  • the current recommendation layer is backtest-owned and provisional; it is suitable for engineering comparison, but not yet an owner-approved production allocator.

Subsections of DSAMbayes Backtest Documentation

Chapter 1

Getting Started

Core orientation and the minimum steps to inspect the active backtest harness.

Subsections of Getting Started

Overview

DSAMbayes-backtest is a dedicated repository for a reproducible walk-forward backtest harness for DSAMbayes marketing mix models.

The repo exists to keep the evaluation contract outside the source modelling repositories. It compares:

  • original DSAMbayes
  • DSAMbayes-Charles-Dev

on the same data, folds, and scoring rules.

What The Repo Owns

  • the manifest and replay contract
  • blocked walk-forward fold planning
  • common KPI-scale holdout scoring
  • adjacent-refit stability analytics
  • repo-owned result artifacts and summary tables
  • a short reporting surface for engineering and peer-review use

Current Scope

The current milestone is intentionally narrow:

  • one single-series time-series pilot at a time
  • one shared comparison contract across both repos
  • one shared external scorer
  • explicit guardrails around lagged and rolling features

Hierarchical / panel backtesting is deferred.

Current Worked Example

The active engineering example is _st, run under an explicit engineering-only scale = FALSE policy. On that example, the repo currently supports:

  • forward holdout scoring
  • parameter stability
  • provisional fixed-budget recommendation stability

The engineering worked example is not the final stakeholder-facing verdict. The intended next substantive use is the UK dataset once it is available.

Quickstart

This is the minimum path to inspect the active backtest harness.

1. Validate The Active Manifest

Rscript scripts/dsambayes-backtest.R validate \
  --manifest .planning/research/pilot_manifest.yaml

This checks the active pilot manifest, dataset contract, repo paths, and fold schedule.

2. Inspect The Planned Run Matrix

Rscript scripts/dsambayes-backtest.R plan \
  --manifest .planning/research/pilot_manifest.yaml

This prints the planned repo-by-fold run surface.

3. Do A Dry Run

Rscript scripts/dsambayes-backtest.R run \
  --manifest .planning/research/pilot_manifest.yaml \
  --dry-run

This writes run-scoped result directories and status artifacts without fitting the external repos.

4. Inspect The Current Worked Example

The completed _st engineering example is under:

results_engineering_m1_st_full/_st/engineering_m1_st_scale_false/
run_id=20260407T211743.943118Z__all-repos__all-folds__live/

Key summary files:

  • summary/holdout_summary.csv
  • summary/parameter_stability_summary.csv
  • summary/recommendation_stability_summary.csv

5. Read The Report

See:

That report is the current compact worked example for the library.

Chapter 2

Reference

Reference material for the current data contract, CLI surface, metrics, and result artifacts.

Subsections of Reference

Data Contract

The backtest harness needs enough information to replay the same model surface consistently across folds and across repos.

Minimum Required Inputs

For a runnable pilot manifest, the repo needs:

  • a weekly source table
  • a date column
  • a KPI / response column
  • the locked model formula
  • priors
  • boundaries
  • repo targets and repo paths
  • fit settings and seed policy

If recommendation stability is in scope, the repo also needs:

  • media spend history or equivalent channel-spend history
  • a declared recommendation contract
  • a channel map from spend inputs to model terms / allocation variables

Why The Source Table Matters

The formulas in scope include lagged and rolling terms. That means fold inputs must be rebuilt from source data at each cutoff rather than sliced from a full-sample engineered matrix.

Tracked Data Packages

The repo keeps smaller GitHub-friendly replication packages under ../data/:

  • _st active engineering pilot
  • _ov reserve candidate
  • _os retained stress fixture

Large reviewed bundles under data_review/ are kept local only.

Current Active Pilot

The active engineering pilot is _st.

The active manifest currently lives in the local planning layer at .planning/research/pilot_manifest.yaml.

Note that .planning/ is local-only in the current repo setup, so colleagues using GitHub alone should rely on the tracked replication data, README, docs, and report rather than the local planning spine.

Results And Artifacts

Backtest outputs are written under run-scoped result trees so filtered reruns do not overwrite earlier summaries.

Result Tree Shape

Typical layout:

results.../
  <dataset_id>/
    <comparison_label>/
      run_id=.../
        experiment_manifest.yaml
        fold_manifest.csv
        summary/
          run_status.csv
          holdout_scores.csv
          holdout_summary.csv
          parameter_stability_summary.csv
          recommendation_stability_summary.csv
        repo_target=<repo>/
          fold_id=01/
            run_manifest.yaml
            run_status.json
            fit_payload.rds
            prediction_payload.rds
            holdout_scores.csv
            recommendations.csv
          stability/
            parameter_drift.csv
            parameter_drift_summary.csv
            recommendation_drift.csv
            recommendation_drift_summary.csv

Most Important Summary Files

  • summary/run_status.csv Fold-by-fold execution state.

  • summary/holdout_summary.csv Repo-level forward holdout comparison.

  • summary/parameter_stability_summary.csv Repo-level adjacent-refit parameter drift summary.

  • summary/recommendation_stability_summary.csv Repo-level recommendation stability summary on the current provisional shared recommendation surface.

Current Worked Example

The active _st engineering batch is:

results_engineering_m1_st_full/_st/engineering_m1_st_scale_false/
run_id=20260407T211743.943118Z__all-repos__all-folds__live/

The holdout and parameter-stability summaries are identical across repos on that example. Recommendation stability is also present, but it should still be treated as provisional.

CLI Reference

The main entrypoint is:

Rscript scripts/dsambayes-backtest.R <command> [options]

validate

Validate a pilot manifest and print the planned run scope.

Rscript scripts/dsambayes-backtest.R validate \
  --manifest .planning/research/pilot_manifest.yaml

plan

Build the run matrix for the active manifest.

Rscript scripts/dsambayes-backtest.R plan \
  --manifest .planning/research/pilot_manifest.yaml

run

Execute a batch or write a dry-run result tree.

Dry run:

Rscript scripts/dsambayes-backtest.R run \
  --manifest .planning/research/pilot_manifest.yaml \
  --dry-run

Target one repo:

Rscript scripts/dsambayes-backtest.R run \
  --manifest .planning/research/pilot_manifest.yaml \
  --repo-target charles_dev

Target one fold:

Rscript scripts/dsambayes-backtest.R run \
  --manifest .planning/research/pilot_manifest.yaml \
  --fold-id 1

Common Options

  • --manifest <path>
  • --repo-target <name>
  • --fold-id <n>
  • --dry-run
  • --results-root <dir>

Current Limitation

The CLI is designed around the current M1 single-series parity surface. It is not yet a general hierarchical / panel backtest runner.

Metrics Reference

Forward Holdout Metrics

  • RMSE Root mean squared error on the observed KPI scale.

  • WMAPE Weighted mean absolute percentage error on the observed KPI scale.

  • Mean error Signed bias on the observed KPI scale.

  • SMAPE Secondary holdout metric on the observed KPI scale.

  • Holdout ELPD / log score Secondary probabilistic diagnostics when compatible posterior outputs are available.

Stability Metrics

  • standardized_posterior_shift Adjacent-refit coefficient shift scaled by posterior uncertainty.

  • allocation_turnover 0.5 * sum(abs(w_t - w_t-1)) across matched channels.

  • marginal_response_rank_corr Spearman correlation of channel marginal-response ranks across adjacent refits on the shared recommendation surface.

Important note:

  • marginal_response_rank_corr is not ROI. It is a rank comparison on the repo-owned recommendation surface. The metric was deliberately renamed from a previous ROI-style label because the current allocator does not compute true ROI.

Interpretation Guidance

  • Holdout metrics address predictive performance.
  • Parameter stability addresses how much posterior media effects move between adjacent refits.
  • Recommendation stability addresses how much the recommended allocation surface moves between adjacent refits under one controlled comparison scenario.

Recommendation stability in the current repo should still be treated as provisional, because the allocator surface is backtest-owned and not yet an owner-approved production policy.