# SimpleTM
Official implementation of SimpleTM for multivariate time series forecasting
## motivation
Most modern time series forecasting models rely on complex attention mechanisms or intricate architectural designs that require significant computational resources and careful hyperparameter tuning. SimpleTM takes a different approach: it demonstrates that a straightforward temporal mixing architecture can achieve competitive results while remaining interpretable and easy to train. The model focuses on learning temporal dependencies directly through simple feed-forward operations, making it accessible for both research and production environments.
## architecture
```mermaid
graph LR
A[Input Time Series] --> B[Normalization]
B --> C[Temporal Mixing Block 1]
C --> D[Temporal Mixing Block 2]
D --> E[Temporal Mixing Block N]
E --> F[Linear Projection]
F --> G[Denormalization]
G --> H[Forecast Output]
style C fill:#e1f5ff
style D fill:#e1f5ff
style E fill:#e1f5ffpip install simpletm
import torch
from simpletm import SimpleTM
# Initialize model
model = SimpleTM(
input_len=96,
output_len=24,
n_features=7,
d_model=512,
n_blocks=4
)
# Prepare data: (batch_size, input_len, n_features)
x = torch.randn(32, 96, 7)
# Generate forecast: (batch_size, output_len, n_features)
forecast = model(x)
# Training
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.MSELoss()
for epoch in range(100):
optimizer.zero_grad()
output = model(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()SimpleTM processes multivariate time series through a stack of temporal mixing blocks. Each block applies learned transformations along the time dimension while preserving the feature structure. The model first normalizes input sequences using reversible instance normalization to handle distribution shifts. Then, temporal mixing blocks extract patterns through linear projections and nonlinear activations applied across time steps. Finally, a projection head maps the processed representation to the target forecast horizon. This design avoids the quadratic complexity of attention while maintaining the ability to capture long-range dependencies through stacked operations.
Key parameters for model initialization:
| Parameter | Type | Default | Description |
|---|---|---|---|
input_len |
int | required | Length of input sequence |
output_len |
int | required | Length of forecast horizon |
n_features |
int | required | Number of input features |
d_model |
int | 512 | Hidden dimension size |
n_blocks |
int | 4 | Number of temporal mixing blocks |
dropout |
float | 0.1 | Dropout rate |
norm_type |
str | 'instance' | Normalization method ('instance', 'batch', or 'none') |
Example configuration file (config.yaml):
model:
d_model: 512
n_blocks: 4
dropout: 0.1
norm_type: instance
training:
batch_size: 32
learning_rate: 0.001
epochs: 100
weight_decay: 1e-5Q: How does SimpleTM compare to Transformer-based models?
A: SimpleTM trades the flexibility of attention for simplicity and efficiency. It runs faster and uses less memory while achieving comparable accuracy on many benchmarks. Choose Transformers if you need explicit attention weights or have very long sequences.
Q: Can I use this for univariate forecasting?
A: Yes, set n_features=1. The model works fine with single-variable series.
Q: What context length should I use?
A: Start with 96 or 192 steps. Longer contexts don't always help and depend on your data's temporal patterns. Experiment with your specific dataset.
Q: Does this work for irregularly sampled data?
A: No, SimpleTM expects regular intervals. Resample your data first or consider interpolation methods.
Q: How do I handle missing values?
A: Impute before feeding data to the model. Forward fill, linear interpolation, or learned imputation all work depending on your use case.
MIT