Scaling Instructable Agents Across Many Simulated Worlds
Paper
• 2404.10179 • Published
• 28
an encoder-decoder model which compresses videos to discrete embeddings (tokens) and a transformer model to translate text embeddings to video tokens.