Julian Bilcke
commited on
Commit
·
bd0b128
1
Parent(s):
52addc5
update research
Browse files- how-frames-work.md +71 -30
how-frames-work.md
CHANGED
|
@@ -133,36 +133,77 @@ This compression strategy:
|
|
| 133 |
2. **Pools temporal information**: Averages remaining frames
|
| 134 |
3. **Maintains continuity**: Ensures smooth transitions
|
| 135 |
|
| 136 |
-
##
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
### 1.
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
-
|
| 150 |
-
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
- **
|
| 158 |
-
- **ActionToPoseFromID**:
|
| 159 |
-
- **
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
## Recommendations for Frame Count Modification
|
| 168 |
|
|
|
|
| 133 |
2. **Pools temporal information**: Averages remaining frames
|
| 134 |
3. **Maintains continuity**: Ensures smooth transitions
|
| 135 |
|
| 136 |
+
## Case Study: Using 17 Frames Instead of 33
|
| 137 |
+
|
| 138 |
+
While the model is trained on 33-frame chunks, we can theoretically adapt it to use 17 frames, which is exactly half the duration and maintains VAE compatibility:
|
| 139 |
+
|
| 140 |
+
### 1. Why 17 Frames Works with VAE
|
| 141 |
+
|
| 142 |
+
17 frames is actually compatible with both VAE architectures:
|
| 143 |
+
|
| 144 |
+
- **884 VAE** (4:1 temporal compression):
|
| 145 |
+
- Formula: (17-1)/4 + 1 = 5 latent frames ✓
|
| 146 |
+
- Clean division ensures proper encoding/decoding
|
| 147 |
+
|
| 148 |
+
- **888 VAE** (8:1 temporal compression):
|
| 149 |
+
- Formula: (17-1)/8 + 1 = 3 latent frames ✓
|
| 150 |
+
- Also divides cleanly for proper compression
|
| 151 |
+
|
| 152 |
+
### 2. Required Code Modifications
|
| 153 |
+
|
| 154 |
+
To implement 17-frame generation, you would need to update:
|
| 155 |
+
|
| 156 |
+
#### a. Core Frame Configuration
|
| 157 |
+
- **app.py**: Change `args.sample_n_frames = 17`
|
| 158 |
+
- **ActionToPoseFromID**: Update `duration=17` parameter
|
| 159 |
+
- **sample_inference.py**: Adjust target_length calculations:
|
| 160 |
+
```python
|
| 161 |
+
if is_image:
|
| 162 |
+
target_length = 18 # 17 generated + 1 initial
|
| 163 |
+
else:
|
| 164 |
+
target_length = 34 # 2 × 17 for video continuation
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
#### b. RoPE Embeddings
|
| 168 |
+
- For image-to-video: Use 21 instead of 37 (18 + 3 for alignment)
|
| 169 |
+
- For video-to-video: Use 37 instead of 69 (34 + 3 for alignment)
|
| 170 |
+
|
| 171 |
+
#### c. CameraNet Compression
|
| 172 |
+
Update the frame count checks in `cameranet.py`:
|
| 173 |
+
```python
|
| 174 |
+
if x.shape[-1] == 34 or x.shape[-1] == 18: # Support both 33 and 17 frame modes
|
| 175 |
+
# Adjust compression logic for shorter sequences
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
### 3. Trade-offs and Considerations
|
| 179 |
+
|
| 180 |
+
**Advantages of 17 frames:**
|
| 181 |
+
- **Reduced memory usage**: ~48% less VRAM required
|
| 182 |
+
- **Faster generation**: Shorter sequences process quicker
|
| 183 |
+
- **More responsive**: Actions complete in 0.68 seconds vs 1.32 seconds
|
| 184 |
+
|
| 185 |
+
**Disadvantages:**
|
| 186 |
+
- **Quality degradation**: Model wasn't trained on 17-frame chunks
|
| 187 |
+
- **Choppy motion**: Less temporal information for smooth transitions
|
| 188 |
+
- **Action granularity**: Shorter actions may feel abrupt
|
| 189 |
+
- **Potential artifacts**: VAE and attention patterns optimized for 33 frames
|
| 190 |
+
|
| 191 |
+
### 4. Why Other Frame Counts Are Problematic
|
| 192 |
+
|
| 193 |
+
Not all frame counts work with the VAE constraints:
|
| 194 |
+
- **18 frames**: ❌ (18-1)/4 = 4.25 (not integer for 884 VAE)
|
| 195 |
+
- **19 frames**: ❌ (19-1)/4 = 4.5 (not integer)
|
| 196 |
+
- **20 frames**: ❌ (20-1)/4 = 4.75 (not integer)
|
| 197 |
+
- **21 frames**: ✓ Works with 884 VAE (6 latent frames)
|
| 198 |
+
- **25 frames**: ✓ Works with both VAEs (7 and 4 latent frames)
|
| 199 |
+
|
| 200 |
+
### 5. Implementation Note
|
| 201 |
+
|
| 202 |
+
While technically possible, using 17 frames would require:
|
| 203 |
+
1. **Extensive testing**: Verify quality and temporal consistency
|
| 204 |
+
2. **Possible fine-tuning**: The model may need adaptation for optimal results
|
| 205 |
+
3. **Adjustment of action speeds**: Camera movements calibrated for 33 frames
|
| 206 |
+
4. **Modified training strategy**: If fine-tuning, adjust hybrid history ratios
|
| 207 |
|
| 208 |
## Recommendations for Frame Count Modification
|
| 209 |
|