Jan 28, 2025

RT-1 and RT-2

RT-1 and RT-2 (short for Robotics Transformer 1 and 2 respectively) represent methods of leveraging pre-trained models for general robotics usage, being specific to robotic arm movement within the papers. This fixes many of the problems with the availablity and quality of data for each subtype of robot and allows the work being done in the other sectors of AI research to directly benefit robotics as well. Although the paper is very direct in the specifics of the model architecture, the importance of the model is in the flexibility of the methods shown in which they can be used to model practically any action-based system, even if they are both computationally very expensive (so expensive that they need to be ran on cloud systems to be used in real world real-time inference scenarios). This post will cover both papers and the other necessary information to understand their architectures (skipping over popular models like CNNs, Transformers, and LLMs).

RT-1:

RT-1 presents a general guideline for utilizing pre-trained encoding models in a robotics system. The general structure follows that of a generic Transformer very closely, with it taking a history of images and a task description as input and outputting tokenized actions for the model. The main contributions made by the model are in how the inputs and outputs are tokenized using pre-trained and established models.

Input Tokenzation:

The inputs are split up into $6$ past images and a textual instruction. Tokenizing each image is handled by a pre-trained CNN (with EfficientNet-B3 being chosen by the paper). This generates a $9\times 9\times 512$ spatial feature map for each image, which are then flattened (instead of being turned into patches) to generate $81$ visual tokens. After the visual tokens are calculated, they are passed through TokenLearner (to be covered later) to further compress the representation of each image into a measily $8$ tokens. The total $48$ tokens are passed to the transformer.

The instruction can be processed with any type of language model encoder (with Universal Sentence Encoder being chosen in the paper) to generate some embedding of the information within. This embedding is used within an identity initialized set of FiLM (to be covered later) layers to alter the information within the visual tokens. The FiLM layers are initialized as identity functions to not harm the weights of the pre-trained CNN since these are the only weights within the entire pre-processing architecture that need to have learnable weights.

FiLM:

FiLM (short for Feature-wise Linear Modulation) are a method of using a sub-network to alter the information of each feature map within a given neural network. This is done through a linear transformation with two networks controlling the parameters, often sharing weights. The sub-network allows the information derived from the text embedding to be used to add extra information to the visual tokens without altering the pre-trained parameters of the CNN. The functions start as identity functions to not harm the weights of the CNN and are trained from there to add information in a manageable way.

\begin{gather*} \gamma_{i,c}=f_c(x_i)\\ \beta_{i,c}=h_c(x_i)\\ \text{FiLM}(F_{i,c}|\gamma_{i,c},\beta_{i,c})=\gamma_{i,c}F_{i,c}+\beta_{i,c} \end{gather*}

TokenLearner:

TokenLearner presents a method of representation learning (turning a dataset into a set of feature maps) based on tokens. The original model takes some Tensor $X\in\mathbb{R}^{T\times H\times W\times C}$ of video data and turns it into a set $S$ of tokens $Z_t=\{z_i\}^S_{i=1}\in\mathbb{R}^{ST\times C}$ for each frame $X_t$ . This is done through a general Tokenizer function $A_i$ shown below for some input $X_t$ . $\odot$ defines an element-wise multiplication and $\rho(\cdot)$ represents spatial global average pooling.

z_i=A_i(X_t)=\rho(X_t\odot A_{iw})=\rho(X_t\odot \gamma(\alpha_i(X_t)))

The function $\alpha(\cdot)$ and the broadcasting function $\gamma(\cdot)$ can be any well-defined function, but are chosen as a set of Convolutional Layers and a Sigmoid. These operations reduce the frame to a dimensionality of $\mathbb{R}^C$

Action Tokenization:

The final output of the Transformer is an 11D vector in a discrete action space that describes the properties of the robot’s actions. The dimensions contain 7 for arm movement and 3 for base movement ((x, y, z, roll, pitch, opening of gripper) and (x, y, yaw) respectively) where each is given 256 discrete value bins to model the given action with each bin being uniformly mapped within the bounds of each variable. An additional value is used to switch between modes (controlling arm, base, and terminating episode).

RT-2:

RT-2 expands on RT-1 by also extending the use of pre-trained models to the main transformer architecture. This gets rid of the need for FiLM layers and pre-processing steps to generate embeddings and just uses the innate strengths of modern LLMs, specifically LVLMs (Large Visual Language Models), simple Multi-Modal extensions of LLMs. The paper presents the methodology on two different models, PaLI-X and PaLM-E, although the difference between the two is so minute as will be seen later that I did not deem it necessary to cover both models in depth.

Robot-Action Fine-Tuning:

The main contribution in the paper comes in how the models can be fine-tuned (training language models on a specific set of data and instructions after general training is complete) to suit robotics output actions. Just like the above model the output is trained to be a set of tokens in a given action space. This is defined with $6$ positional and rotational displacements and a level extension of the robotic arm gripper, which are also again split into $256$ bins each. Since each bin can use $8$ integer numbers to denote it’s value for each dimension, a token output is needed for each outpit bin. This is done by replacing redundant language tokens with the given action tokens within the model. PaLI-X has preset number tokens so it’s tokens for each corresponding number $1$ through $256$ are replaced with the corresponding action bin. PaLM-E has it’s $256$ least frequently used tokens replaced with each action bin.

In order to still ensure generalizability during fine-tuning, the model puts in place two measures. Co-Fine-Tuning ensures that each batch is has equal amounts of specific Robot Data mixed with the original Web Data the model was trained on. This ensures that the model still retains it’s weights for language and video processing while still training it to learn how to use the output tokens. Along this same line, The output is constrained for what is considered valid instructions to only the $256$ bin specific tokens, with every other token in the output being ignored.