Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high-frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion-based policies, where iterative denoising steps can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. To address these limitations, we present a novel frequency-based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub-frequency manifolds with expanding spectral bands. Validated on 13 robotic manipulation tasks from 4 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the fine-grained details necessary for successful task execution.

Simulated Environments & Tasks

Benchmark Results

Method / Task	Robosuite				MimicGen		Average
Method / Task	Lift	Stack	Can	Square	Three Piece Assembly	Stack Three	Average
DP3	88.7±4.2	72.0±2.0	64.7±1.2	36.7±1.2	35.3±6.4	20.0±3.5	52.9
DiT-Policy	90.7±4.2	68.7±7.6	64.7±3.1	34.7±2.3	37.3±7.6	18.7±5.0	52.5
FreqPolicy	89.3±1.2	71.3±1.2	63.3±2.3	36.0±3.5	27.3±8.1	22.0±4.0	51.5
FGO (Ours)	92.7±3.1	79.3±3.1	66.0±0.0	36.7±3.1	39.3±7.0	25.3±3.1	56.6

Table 1: Comparison of success rates (%) on the Robosuite and MimicGen benchmarks. Results are computed across 3 training seeds.

Method / Task	Adroit			DexArt				Average
Method / Task	Hammer	Door	Pen	Laptop	Toilet	Faucet	Bucket	Average
DP3	100.0±0.0	61.3±7.6	46.0±5.3	77.3±5.0	60.7±4.2	21.3±4.2	24.7±2.3	55.9
DiT-Policy	100.0±0.0	63.3±7.6	52.0±2.0	75.3±3.1	63.3±5.0	20.7±3.1	19.3±3.1	56.3
FreqPolicy	98.7±1.2	68.0±3.5	52.0±3.5	78.0±8.0	58.7±4.6	20.7±5.0	18.7±3.1	56.4
FGO (Ours)	100.0±0.0	69.3±2.3	55.3±1.2	81.3±6.4	66.7±1.2	24.0±3.5	25.3±2.3	60.3

Table 2: Comparison of success rates (%) on the Adroit and DexArt benchmarks. Results are computed across 3 training seeds.

Method	ATV ↓ (×10^-3 rad/s)	JerkRMS ↓ (rad/s³)	Training Time ↓ (GPU h)	Inference Speed ↓ (ms)
DP3	14.83±0.17	50.87±1.27	0.47	39.49
DiT-Policy	14.84±0.22	51.01±1.16	0.42	17.20
FreqPolicy	15.25±0.39	46.91±1.58	0.35	33.49
FGO (Ours)	14.76±0.17	40.79±0.46	0.48	44.22

Table 3: Comparison of Action Total Variation (ATV), JerkRMS, training time, and inference speed.

BibTex

@article{wang2026fgo,
  author    = {Wang, Junlin},
  title     = {Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal},
  journal   = {arXiv preprint},
  year      = {2026},
}