Jan 4, 2025

Building Up to ConvMixer

ConvMixer represents a relatively modern method of Computer Vision based on patch-based input processing and acts as a conglomeration of almost all of the modern methods of stabilizing and optimizing Convolutional Neural Networks. This post serves to build up from scratch (and by scratch I mean that you know the basis of neural networks and CNNs already) all the way up to ConvMixer. The paper is generally very short for all of the architecture explanations, so I deemed it appropriate to give a method that explains all of the context that is necessary before providing an architectural overview.

Batch Normalization:

Batch Normalization is the modern standard for stabilizing the learning process of Neural Networks. It acts to reduce any variance in the activations of each neuron as the network is still learning, which without any method of regularization often causes in terms of model complexity and performance. This is done within a methodology of processing an entire batch of the dataset in parallel. Once each layer reaches the same section and has it’s activations ready, the mean and variance of all the layer activations are obtained, with $x_i$ being the activation of the $i$ -th neuron.

\begin{gather*} \mu_B=\frac{1}{m}\sum^m_{i=1}x_i\\ \sigma^2_B=\frac{1}{m}\sum^m_{i=1}(x_i-\mu_B)^2 \end{gather*}

These are calculated because the main point behind Batch Normaliation is to set the mean and variance of these to 0 and 1 respectively, hence the following calculation to normalize each neuron activation.

\hat{x}_i=\frac{x_i-\mu_B}{\sqrt{\sigma^2_B+\epsilon}}

In order to limit any misrepresentations generated by this process, an extra linear transform is used to change the scale $\gamma$ and to shift $\beta$ the output with two learnable parameters. These learnable parameters stay in the model and still alter the information given to each layer even after training. Since the above sections are still necessary to make these function the same they did during training, a running average of the mean and variance are taken during training and turned into constant values held by the model, which are then used instead of the batch based metrics used here.

y_i=\gamma\hat{x}_i+\beta

The LayerNorm layer that is present within the many Transformer networks and in the later mentioned MLP-Mixer is a simplified version of this which rather than taking the mean and variance of an entire batch simply takes the mean and variance of the layer in isolation.

GELU:

The GELU is the modern standard for activation functions especially for computer vision. It’s based on the CDF of the Gaussian Distribution defined here as $\Phi(x)$ . This inherent relation with a the distribution allows for the model to obtain finer control around smaller inputs, and also provides a smooth curve between negative and positive values.

\text{GELU}(x)=x\Phi(x)=x\cdot\frac{1}{2}[1+\text{erf}(x/\sqrt{2})]

As well it can also be approximated with the following to provide a faster method of calculating it.

0.5x(1+\tanh[\sqrt{2/\pi}(x+0.044715x^3)])

Vision Transformer:

You don’t need to know anything about Transformer architecture at all for this one. All that needs to be known is the method in which they are formatted for image processing based problems. In order to transform images into something that the natively sequence based Transformer can process it needs to break the image down in a predictable manner. The process chosen was to break the image up into a set of square patches and treating each of those as the input sequence after a set of embeddings for extra positional information. This sort of methodology provides the ground work for the patch-based systems adopted by the models like ConvMixer.

MLP-Mixer:

MLP-Mixer is referenced in the paper as another model that inspired the patch-based modality of the input processing. It shares the same methodology of creating a model architecture that processes patches instead of the entire image. It does stray away from the general methods within the network especially during patch processing due to it’s reliance on only using MultiLayer Perceptrons (another name for simple neural networks) for processing, but it does share the last couple layers in common. It adopts the method of processing the outputs right before the Class layer with Global Average Pooling right into a Fully-Connected NN to provide an output. Global Average Pooling is a simple case and simply takes the average across an entire feature dimension instead of just within a small kernel. This allows for one last method of mixing the information within patches and provides good results.

Depthwise Separable Convolution:

ConvMixer extensively uses the Depthwise Separable Convolution which was introduced in the paper about the Xception model. It presents a method of significantly speeding up the convolutional process, with the only small weakness being information that is both cross-spatial and cross-channel, which is not a very common issue within processing. It can be broken down into two steps, with a Depthwise Convolution (channel-specific spatial filtering) and a Pointwise Convolution (spatial-specific channel filtering) which feed directly into the next.

Y=\text{PointwiseConv}(\text{DepthwiseConv}(X))

The Depthwise Convolution comes first in the network, and performs a convolution on each channel separately. This lowers the number of computations required for each convolution and even with more convolutions being required it helps speed up computations significantly. The general procedure is mathematically defined below for an input of $C$ channels with each $K$ being the weight matrix of that channel.

\text{DepthwiseConv}(X)=\{\text{Conv}(X^c,K^c)|c\in{1,2,\dots,C}\}

The Pointwise Convolution comes right after and acts to mix the information between each neuron. This is done with a $1\times 1$ kernel convolution on each location which serves to mix the and further process the inter-channel relationships in each location. Again, it is defined mathematically below with the key difference being in the kernel and stride of $1$ which is not mentioned in the equation.

\text{PointwiseConv}(X)=\{\text{Conv}(X^c,K^{c^\prime})|c\in{1,2,\dots,C^\prime}\}

ConvMixer:

This all comes together to explain the general architectural design for ConvMixer. The model architecture can be broken down into the following three parts.

Patch Embedding
Any number of ConvMixer Layers
Final Preprocessing (the same as that shown in MLP-Mixer)

Patch Embedding is relatively simple and is in some middle ground between the ViT and MLP-Mixer method. It simply takes a convolution of size and kernel $p$ and applies the GELU $\sigma$ and Batch Normalization layer $\text{BN}$ . This serves to not only give a much smaller resolution representation of the image but also increases the number of channels and feature dimensions from $c_{\text{in}}$ to $h$ , which is a critical step for the final Global Average Pooling.

z_0=\text{BN}(\sigma\{\text{Conv}_{c_{\text{in}}\rightarrow h}(X,p,p)\})

This then moves right into the ConvMixer Layer, which is represented by a Depthwise Separable Convolution with GELU and BN layers in between. The only thing not covered so far is the residual connection (otherwise called a skip connection), but I skipped it because it plays such a small part within the model and the following equations should be able to do the explanation justice. In short it just adds the input before being processed to the post processed output.

\begin{gather*} z^\prime_l=\text{BN}(\sigma\{\text{ConvDepthwise}(z_{l-1})\})+z_{l-1}\\ z_{l+1}=\text{BN}(\sigma\{\text{ConvPointwise}(z^\prime_l)\}) \end{gather*}

After the data is ran through any number of these depending on specific model specifications, the final post processing step is reached. It uses the same general process that MLP-Mixer uses, except on each channel. This is then fed to the final Fully-Connected Layer which gives the output of the model.

\text{GAP}(c)=\frac{1}{H\times W}\sum_{i=1}^H\sum_{j=1}^W x_{c,i,j}