Research / arXiv preprint arXiv:2605.19976 2026 Featured

RECIPE: Procedural Planning via Grounding in Instructional Video

Luigi Seminara, Antonino Furnari, Lorenzo Torresani

Abstract

Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot.

Cite

@article{seminara2026recipe,
  title={RECIPE: Procedural Planning via Grounding in Instructional Video},
  author={Seminara, Luigi and Furnari, Antonino and Torresani, Lorenzo},
  journal={arXiv preprint arXiv:2605.19976},
  year={2026}
}