A Physical Reasoning Benchmark
for Robot Learning and Planning

Yixuan Huang^*,1 Bowen Li^*,2 Vaibhav Saxena^*,3 Yichao Liang⁴ Utkarsh Mishra³ Liang Ji¹ Lihan Zha¹ Jimmy Wu⁵ Nishanth Kumar⁶ Sebastian Scherer² Danfei Xu^3,5 Tom Silver¹

¹Princeton ²CMU ³Georgia Tech ⁴Cambridge ⁵NVIDIA ⁶MIT

^*Equal contribution.

Robotics: Science and Systems (RSS), 2026

arXiv Code BibTeX

tl;dr

KinDER is a stress test for Kinematic and Dynamic Embodied Reasoning with spatial relations, nonprehensile multi-object manipulation, combinatorial geometric constraints, tool use, and dynamic constraints. We evaluate imitiation learning (diffusion policies, VLAs), reinforcement learning (PPO, SAC, MBRL), foundation model-based planning (LLM, VLM), diffusion-based planning (GSC), and task and motion planning (bilevel planning). No methods pass. Physical reasoning remains a core open challenge for robotics.

KinDER Overview

Hover over the environments below to see them in action

BalanceBeam3D

BaseMotion3D

ClutteredRetrieval2D

ClutteredStorage2D

ConstrainedCupboard3D

DynObstruction2D

DynPushPullHook2D

DynPushT2D

DynScoopPour2D

Dynamo3D

Motion2D

Obstruction2D

Obstruction3D

Packing3D

PushPullHook2D

Rearrange3D

ScoopPour3D

Shelf3D

SortClutteredBlocks3D

StickButton2D

SweepIntoDrawer3D

SweepSimple3D

Table3D

Tossing3D

Transport3D

About KinDER

A physical reasoning benchmark for robot learning and planning.

Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environment, and the task at hand. We introduce KinDER, a benchmark for Kinematic and Dynamic Embodied Reasoning that targets physical reasoning challenges arising in robot learning and planning. KinDER comprises 25 procedurally generated environments, a Gymnasium-compatible Python library with parameterized skills and demonstrations, and a standardized evaluation suite with 13 implemented baselines spanning task and motion planning, imitation learning, reinforcement learning, and foundation-model-based approaches. The environments are designed to isolate five core physical reasoning challenges: basic spatial relations, nonprehensile multi-object manipulation, tool use, combinatorial geometric constraints, and dynamic constraints, disentangled from perception, language understanding, and application-specific complexity. Empirical evaluation shows that existing methods struggle to solve many of the environments, indicating substantial gaps in current approaches to physical reasoning. We additionally include real-to-sim-to-real experiments on a mobile manipulator to assess the correspondence between simulation and real-world physical interaction. KinDER is fully open-sourced and intended to enable systematic comparison across diverse paradigms for advancing physical reasoning in robotics.

What Makes KinDER Challenging?

For Reinforcement Learning

Environments have long horizons and sparse rewards. Users are welcome to engineer dense rewards, but doing so may be nontrivial. Environments also have very diverse task distributions, so learned policies must generalize.

For Imitation Learning

Physical reasoning requires understanding physical constraints (e.g., spatial relationships, kinematics, dynamics). Imitating surface-level patterns in demonstrations is not enough to generalize to broad task distributions.

For Vision-Language Models

The physical reasoning required in KinDER is not easy to represent in natural language. Spatial reasonining, a subset of physical reasoning, is a known challenge for vision-language models.

For Hierarchical Approaches

Approaches that first decide "what to do" and then decide "how to do it" -- whether in hierarchical RL, planning, or with multi-level foundation models -- will run into difficulties when there are couplings between these high-level and low-level decisions.

For Task and Motion Planning

Models for TAMP are optionally available in the companion kinder-baselines package, but they are not guaranteed to solve every environment — especially in the dynamic settings — and planning remains slow when many objects are involved.

For Human Engineers

KinDER features diverse task distributions and long time horizons, making it challenging for engineers to hand-design even very environment-specific solutions. A single task may be straightforward, but designing generalized solutions is nontrivial.

Installation and Usage

Installation

KinDER supports Python >=3.10, <3.13. The base install includes all environments:

pip install kindergarden  # installs everything

If you only need a subset of environments, the following lightweight alternatives install just the dependencies for that category:

pip install kindergarden[kinematic2d]
pip install kindergarden[dynamic2d]
pip install kindergarden[kinematic3d]
pip install kindergarden[dynamic3d]

For development or installing from source, see the GitHub repository.

Basic Usage (Gym API)

import kinder
kinder.register_all_environments()
env = kinder.make("kinder/Obstruction2D-o3-v0")  # 3 obstructions
obs, info = env.reset()  # procedural generation
action = env.action_space.sample()
next_obs, reward, terminated, truncated, info = env.step(action)
img = env.render()

import imageio.v2 as iio
iio.imwrite("obs.png", img)  # open obs.png to see the scene!

Object-Centric States

All environments in KinDER use object-centric states:

from kinder.envs.kinematic2d.obstruction2d import ObjectCentricObstruction2DEnv
env = ObjectCentricObstruction2DEnv(num_obstructions=3)
obs, _ = env.reset(seed=123)
print(obs.pretty_str())

Here, obs is an ObjectCentricState, and the printout is:

############################################################### STATE ###############################################################
type: crv_robot           x         y    theta    base_radius    arm_joint    arm_length    vacuum    gripper_height    gripper_width
-----------------  --------  --------  -------  -------------  -----------  ------------  --------  ----------------  ---------------
robot              0.885039  0.803795  -1.5708            0.1          0.1           0.2         0              0.07             0.01

type: rectangle           x         y    theta    static    color_r    color_g    color_b    z_order      width     height
-----------------  --------  --------  -------  --------  ---------  ---------  ---------  ---------  ---------  ---------
obstruction0       0.422462  0.100001        0         0       0.75        0.1        0.1        100  0.132224   0.0766399
obstruction1       0.804663  0.100001        0         0       0.75        0.1        0.1        100  0.0805652  0.0955062
obstruction2       0.559246  0.100001        0         0       0.75        0.1        0.1        100  0.12608    0.180172

type: target_block          x         y    theta    static    color_r    color_g    color_b    z_order     width    height
--------------------  -------  --------  -------  --------  ---------  ---------  ---------  ---------  --------  --------
target_block          1.20082  0.100001        0         0   0.501961          0   0.501961        100  0.138302  0.155183

type: target_surface           x    y    theta    static    color_r    color_g    color_b    z_order     width    height
----------------------  --------  ---  -------  --------  ---------  ---------  ---------  ---------  --------  --------
target_surface          0.499675    0        0         1   0.501961          0   0.501961        101  0.180286       0.1
#####################################################################################################################################

For compatibility with baselines, observations are provided as vectors. Convert between vectors and object-centric states:

import kinder
kinder.register_all_environments()
env = kinder.make("kinder/Obstruction2D-o3-v0")
vec_obs, _ = env.reset(seed=123)
object_centric_obs = env.observation_space.devectorize(vec_obs)
recovered_vec_obs = env.observation_space.vectorize(object_centric_obs)

Baselines

We provide implementations of several baselines in the kinder-baselines repository. Installation and usage instructions are available there.

Bilevel Planning

TAMP-style bilevel planning

Domain-Specific Policies

Hand-engineered policies with domain-specific models

Diffusion Policies

Learning from demonstrations

Reinforcement Learning

RL with sparse and shaped rewards

VLA Policies

Finetuning pi-0.5 with demonstrations

LLM & VLM Planning

Large language and vision-language model based planning

Tutorials

Step-by-step tutorials for getting started with KinDER.

kinder

KinDERGarden: Environments

KinDER environments are organized into four categories.

Kinematic 2D

Kinematic 3D

BaseMotion3D 1 variant
Obstruction3D 5 variants
Packing3D 3 variants
Table3D 3 variants
Transport3D 2 variants

Dynamic 2D

DynObstruction2D 4 variants
DynPushPullHook2D 3 variants
DynPushT2D 1 variant
DynScoopPour2D 4 variants

Dynamic 3D

KinDERBench: Benchmark and Results

See the plot below for a summary of our main empirical results. See paper for details and additional results with additional metrics.

Success rates: mean ± std across 5 seeds and 50 episodes per seed.

Real Robot Validation

Real robot versions of KinDER tasks with real-to-sim-to-real planning.

Citation

If you use KinDER in your research, please cite our RSS 2026 paper:

@inproceedings{huang2026kinder,
  title     = {KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning},
  author    = {Huang, Yixuan and Li, Bowen and Saxena, Vaibhav and Liang, Yichao
               and Mishra, Utkarsh and Ji, Liang and Zha, Lihan and Wu, Jimmy
               and Kumar, Nishanth and Scherer, Sebastian and Xu, Danfei and Silver, Tom},
  booktitle = {Robotics: Science and Systems (RSS)},
  year      = {2026}
}

Contact

Have questions, feedback, or found a bug? Please open an issue on GitHub:

A Physical Reasoning Benchmark
for Robot Learning and Planning

KinDER Overview

About KinDER

What Makes KinDER Challenging?

For Reinforcement Learning

For Imitation Learning

For Vision-Language Models

For Hierarchical Approaches

For Task and Motion Planning

For Human Engineers

Installation and Usage

Installation

Basic Usage (Gym API)

Object-Centric States

Baselines

Bilevel Planning

Domain-Specific Policies

Diffusion Policies

Reinforcement Learning

VLA Policies

LLM & VLM Planning

Tutorials

Getting Started

Model-Predictive Control

Bilevel Planning

KinDERGarden: Environments

Kinematic 2D

Kinematic 3D

Dynamic 2D

Dynamic 3D

KinDERBench: Benchmark and Results

Real Robot Validation

Citation

Contact

A Physical Reasoning Benchmark for Robot Learning and Planning

KinDER Overview

About KinDER

What Makes KinDER Challenging? ▼

For Reinforcement Learning

For Imitation Learning

For Vision-Language Models

For Hierarchical Approaches

For Task and Motion Planning

For Human Engineers

Installation and Usage

Installation

Basic Usage (Gym API)

Object-Centric States ▼

Baselines

Bilevel Planning

Domain-Specific Policies

Diffusion Policies

Reinforcement Learning

VLA Policies

LLM & VLM Planning

Tutorials

Getting Started

Model-Predictive Control

Bilevel Planning

KinDERGarden: Environments

Kinematic 2D

Kinematic 3D

Dynamic 2D

Dynamic 3D

KinDERBench: Benchmark and Results

Real Robot Validation

Citation

Contact

A Physical Reasoning Benchmark
for Robot Learning and Planning

What Makes KinDER Challenging?

Object-Centric States