Skip to content Skip to footer

data2vec: A Milestone in Self-Supervised Studying

Machine studying fashions have closely relied on labeled information for coaching, and historically talking, coaching fashions on labeled information yields correct outcomes. Nonetheless, the principle draw back of utilizing labeled information is the excessive annotation prices that rise with a rise within the measurement of the coaching information. Excessive annotation prices are an enormous hurdle for builders, particularly when engaged on a big undertaking with substantial quantities of coaching information.

To sort out the annotation difficulty, builders got here up with the idea of SSL or Self Supervised Studying. Self Supervised Studying is a machine studying course of by which the mannequin trains itself to be taught a portion of the enter from one other a part of the enter. A Self Supervised Studying mannequin goals to use the connection between the information as an alternative of utilizing labeled information’s supervised indicators. 

Along with Self Supervised Studying, there are a number of different strategies & fashions to coach machine studying fashions with out the usage of labeled information. Nonetheless, most of those strategies have two main points

  1. They’re usually specialised for a single modality like a picture or a textual content. 
  2. They require a excessive quantity of computational energy. 

These limitations are a significant difficulty why a median human thoughts is ready to be taught from a single sort of information way more successfully when in comparison with an AI mannequin that depends on separate fashions & coaching information to tell apart between a picture, textual content, and speech. 

To sort out the difficulty of single modality, Meta AI launched the data2vec, the primary of a form, self supervised high-performance algorithm to be taught patterns info from three totally different modalities: picture, textual content, and speech. With the implementation of the data2vec algorithm, textual content understandings might be utilized to a picture segmentation downside, or it will also be deployed in a speech recognition process. 

On this article, we will likely be speaking in regards to the data2vec mannequin in-depth. We are going to focus on the strategy overview, associated work, structure, and outcomes of the mannequin in larger depth so that you’ve got a transparent understanding of the data2vec algorithm. 

Data2vec Introduction: The Core Concept

Though the basic idea of Self Supervised Studying is utilized throughout modalities, precise targets & algorithms differ from one another as a result of they have been designed in respect to a single modality. Designing a mannequin for a single modality is the rationale why the identical self supervised studying algorithm can not work successfully throughout totally different sorts of coaching information. 

To beat the problem offered by single modality fashions & algorithms, Meta AI launched the data2vec, an algorithm that makes use of the identical studying methodology for both pc imaginative and prescient, NLP or speech.  

The core thought behind the data2vec algorithm is to make use of the masked view of the enter to predict latent representations of the total enter information in a self-distillation setup with the assistance of normal Transformer structure. So, as an alternative of modality-specific objects like photographs, textual content, or voice which might be native in nature, the data2vec algorithm predicts latent representations with info from the entire coaching or enter information. 

Why Does the AI Business Want the Data2Vec Algorithm?

Self Supervised Studying fashions construct representations of the coaching information utilizing human annotated labels, and it’s one of many main causes behind the development of the NLP or Pure Language Processing, and the Laptop Imaginative and prescient expertise. These self supervised studying representations are the rationale why duties like speech recognition & machine studying deploy unsupervised studying of their fashions. 

Till now, these self supervised studying algorithms concentrate on particular person modalities that lead to studying biases, and particular designs within the fashions. The person modality of self supervised studying algorithms create challenges in numerous AI purposes together with pc imaginative and prescient & NLP. 

For instance, there are vocabulary of speech items in speech processing that may outline a self-supervised studying process in NLP. Equally, in pc imaginative and prescient, builders can both regress the enter, be taught discrete visible tokens, or be taught representations invariant to information augmentation. Though these studying biases are helpful, it’s troublesome to verify whether or not these biases will generalize to different modalities. 

The data2vec algorithm is a significant milestone within the self-supervised studying business because it goals at bettering a number of modalities quite than only one. Moreover, the data2vec algorithm will not be reliant on reconstructing the enter or contrastive studying. 

So the rationale why the world wants data2vec is as a result of the data2vec algorithm has the potential of accelerating progress in AI, and contributes in creating AI fashions that may find out about totally different points of their environment seamlessly. Scientists hope that the data2vec algorithm will enable them to develop extra adaptable AI and ML fashions which might be able to performing extremely superior duties past what immediately’s AI fashions can do.

What’s the Data2Vec Algorithm?

The data2vec is a unified framework that goals at implementing self-supervised machine studying throughout totally different information modalities together with photographs, speech, and textual content. 

The data2vec algorithm goals at creating ML fashions that may be taught the final patterns within the atmosphere a lot better by protecting the training goal uniform throughout totally different modalities. The data2vec mannequin unifies the training algorithm, however it nonetheless learns the representations for every modality individually. 

With the introduction of the data2vec algorithm, Meta AI hopes that it’s going to make multimodal studying efficient, and way more less complicated. 

How Does the Data2Vec Algorithm Work?

The data2vec algorithm combines the learnings of latent goal representations with masked prediction, though it makes use of a number of community layers as targets to generalize the latent representations. The mannequin particularly trains an off-the-shelf Transformer community that’s then used both within the instructor or scholar mode. 

Within the instructor mode, the mannequin first builds the representations of the enter information that serves as targets within the studying process. Within the scholar mode, the mannequin encodes a masked model of the enter information that’s then used to make predictions on full information representations. 

The above image represents how the data2vec mannequin makes use of the identical studying course of for various modalities. In step one, the mannequin produces representations of the enter information (instructor mode). The mannequin then regresses these representations on the idea of a masked model of the enter. 

Moreover, because the data2vec algorithm makes use of latent representations of the enter information, it may be considered as a simplified model of the modality-specific designs like creating appropriate targets by normalizing the enter or studying a set set of visible tokens. However the essential differentiating level between the data2vec & different algorithms is that the data2vec algorithm makes use of self-attention to make its goal illustration contextualized & steady. Alternatively, different self-supervised studying fashions use a set set of targets which might be based mostly on a neighborhood context. 

Data2vec: Mannequin Methodology

The data2vec mannequin is skilled by predicting the mannequin representations of the enter information given a partial view of the enter. As you may see within the given determine, the canine’s face is masked, a specific part of the voice word is masked, and the phrase “with” is masked within the textual content. 

The mannequin first encodes a masked model of the coaching pattern(scholar mode), after which encodes the unmasked model of the enter to assemble coaching targets with the identical mannequin however solely when it’s parameterized because the exponential common of the mannequin weights(instructor mode). Moreover, the goal representations encode the knowledge current within the coaching pattern, and within the scholar mode, the training process is used to foretell these representations when given a partial view of the enter. 

Mannequin Structure

The data2vec mannequin makes use of a typical Transformer structure with modality-specific encoding of the enter information. For duties associated to pc imaginative and prescient, the mannequin makes use of the ViT technique to encode a picture as a sequence of patches the place every picture spans over 16×16 pixels, and fed as a linear transformation. 

Moreover, the information for speech recognition, the mannequin encodes the information utilizing a multi-layer 1-D convolutional neural community that maps the 16 kHz waveforms into 50 Hz representations. To course of the textual content information, the mannequin preprocesses the information to extract sub-word items, after which embeds the information in distributional area by way of embedding vectors. 


As soon as the mannequin embeds the enter information as a sequence of tokens, the mannequin masks elements of those items by changing them with an embedding token, after which feeds the sequence to the Transformer community. For pc imaginative and prescient, the mannequin practices block-wise marking technique. Latent speech representations are used to masks spans of speech information, and for language associated duties, the tokens are masked. 

Coaching Targets

The data2vec mannequin goals at predicting the mannequin representations of the unmasked coaching pattern based mostly on an encoding of the masked pattern that was initially feeded to the mannequin. The mannequin predicts the representations just for masked time-steps. 

The mannequin predicts contextualized representations that not solely encode the actual time-step, however it additionally encodes different info from the pattern as a result of it makes use of self-attention within the Transformer community. The contextualized representations & the usage of Transformer community is what distinguishes the data2vec mannequin from already current BERT, wav2vec, BEiT, SimMIM, MAE, and MaskFeat fashions that predict targets with out contextual info. 

Right here is how the data2vec mannequin parameterizes the instructor mode to foretell the community representations that then function targets. 

Instructor Parameterization

The data2vec mannequin parameterized the encoding of the unmasked coaching pattern with the usage of EMA or Exponential Transferring Common of the mannequin parameters(θ) the place the weights of the mannequin within the goal mode(△) are as follows

                                           ∆ ← τ∆ + (1 − τ ) θ


Moreover, the mannequin schedules for τ that linearly will increase the parameter from  τ0 to τe (goal worth) over the primary τn updates. After these updates, the mannequin retains the worth fixed till the coaching will get over. Using the EMA technique updates the instructor way more regularly to start with when the coaching begins when the mannequin is random. Because the coaching proceeds & good parameters have been discovered, the instructor will get up to date much less regularly. 

The outcomes present that the mannequin is extra environment friendly & correct when it shares the parameters of the characteristic encoder & positional encoder between the coed & the instructor mode. 


The development of the coaching targets are depending on the output of the highest Okay blocks of the instructor community for time-steps which might be masked within the scholar mode. The output of the block l at any time-step t is denoted as alt. The mannequin then applies normalization to every block to acquire âlt earlier than it averages the highest Okay blocks 



to acquire the coaching goal yt for time-step t for a community with L blocks in whole. 

It creates coaching targets that the mannequin regresses when it is in scholar mode. Within the preliminary experiments, the data2vec mannequin carried out properly in predicting every block individually with a devoted projection, and being way more environment friendly on the similar time. 

Moreover, normalizing the targets additionally permits the data2vec mannequin from collapsing into fixed representations for time-steps, and stopping layers with excessive normalization to dominate the options within the goal dataset. For speech recognition, the mannequin makes use of occasion normalization over the present enter pattern with none discovered parameters. It’s primarily as a result of because the stride over the enter information is small, the neighboring representations are extremely correlated. 

Moreover, the researchers discovered that when working with pc imaginative and prescient and NLP, parameter-less normalization does the job sufficiently. The issue will also be solved with Variance-Invariance-Covariance regularization however the technique talked about above performs sufficiently properly, and it doesn’t require any extra parameters. 


For contextualized coaching targets yt, the mannequin makes use of a Easy L1 loss to regress the targets as talked about under

Right here, β is answerable for transitioning from a squared loss to an L1 loss, and it relies upon closely on the dimensions of the hole between the mannequin prediction ft(x) at time-step t. The benefit of this loss is that it’s comparatively much less delicate to the outliers, with the necessity to tune the setting of β

Experimental Setup

The data2vec mannequin is experimented with two mannequin sizes: data2vec Massive and data2vec Base. For numerical stability, the EMA updates are achieved in fp32, and the fashions include L= 12 or L= 24 Transformer blocks with hidden dimensions(H) = 768 or H= 1024.  Let’s have an in depth take a look at the experimental setup for various modalities, and functions. 

Laptop Imaginative and prescient

The data2vec mannequin embeds photographs of 224×224 pixels as patches of 16×16 pixels. Every of those patches is remodeled linearly, and a sequence with 196 representations is fed to the usual Transformer. 

The mannequin follows BEiT to masks blocks with adjoining patches with every block having a minimal of 16 patches with a random facet ratio. Nonetheless, as an alternative of masking 40% of the patch as initially within the BEiT mannequin, the data2vec mannequin masks 60% of the patch for higher accuracy. 

Moreover, the mannequin randomly resizes the picture crops, horizontal flips, and colour jittering. Lastly, the data2vec mannequin makes use of the identical modified picture in each the instructor & the coed mode. 

The ViT-B fashions are pre-trained for 800 epochs, and the data2vec mannequin makes use of the batch measurement of 8,192 for the ViT-L mannequin, and a pair of,048 for the ViT-B mannequin. The data2vec mannequin additionally makes use of a cosine, and a Adam schedule with a single cycle to heat up the training fee for 80 epochs to 0.001 for ViT-L, and for 40 epochs to 0.001 for ViT-B. 

For each ViT-B, and ViT-L, the data2vec mannequin makes use of β = 2, Okay = 6 and τ = 0.9998 as fixed with no schedule. The mannequin additional makes use of the stochastic depth fee 0.2. 

Moreover, for ViT-L, the mannequin trains for 1,600 epochs the place the primary 800 epochs have a studying fee as 0.9998, after which the mannequin resets the training fee schedule, and continues for the ultimate 800 epochs with studying fee as 0.9999. 

For picture classification, the mannequin makes use of the mean-pool of the output of the final Transformer block, and feeds it to the softmax-normalized classifier. The mannequin then high-quality tunes the ViT-L for 50 epochs, and ViT-B for 100 epochs utilizing the cosine, and Adam to warmup the training fee. 

Speech Processing

For speech processing, the data2vec mannequin makes use of the Fairseq, a sequence-modeling equipment used to coach buyer fashions for summarization, translation, and textual content era. The mannequin takes 16 kHz waveform as enter that’s processed utilizing a characteristic encoder, and incorporates temporal convolutions with 512 channels, kernel widths (10,3,3,3,3,2,2), and strides (5,2,2,2,2,2,2). 

The above ends in the output frequency of the encoder being 50Hz, and it has a stride of 20ms between every pattern. The receptive area contains of 400 enter samples or 25 ms of audio. The uncooked waveform fed to the encoder is normalized to unit variance, and 0 imply

The masking technique utilized by the data2vec for the Base mannequin resembles the Baevski framework for self-supervised studying in speech recognition. The mannequin samples p = 0.065 for all time-steps to be beginning indices, and proceeds to mark the next ten time-steps. For a typical coaching sequence, the method permits virtually 49% of the overall time-steps to be masked. 

Throughout coaching, the data2vec mannequin linearly anneals τ utilizing τo = 0.999, τe = 0.9999, and τn = 30,000. The data2vec mannequin makes use of the Adam optimizer with the height studying fee being 5×10-4 for the Base mannequin. Moreover, the bottom mannequin makes use of a tri-stage scheduler that warms up the training fee linearly for the primary 3% of updates, maintains it for the subsequent 90%, after which proceeds to decay it linearly for the remaining 7%. 

Pure Language Processing

The data2vec mannequin makes use of the byte-pair encoding of 50K sorts to tokenize the enter, and the mannequin then learns an embedding for every sort. After the information is encoded, the mannequin applies the BERT masking technique to fifteen% of uniformly chosen tokens by which 80% are changed by discovered masks tokens, 10% are changed by random vocabulary tokens, and the remaining 10% are unchanged. 

Throughout pre-training the mannequin makes use of τo = 0.999, τe = 0.9999, and τn = 100,000, Okay= 10, and β = 4. The mannequin makes use of the Adam optimizer with a tri-stage studying fee schedule that warms up the training fee linearly for the primary 5% of updates, maintains it for the subsequent 80%, after which proceeds to decay it linearly for the remaining 15%, with the height studying fee being 2×10-4

Moreover, the mannequin trains on 16 GPUs with a batch measurement of 256 sequences, and every sequence containing about 512 tokens. For downstreaming, the mannequin is pre-trained in 4 totally different studying charges: 1×10-4, 2×10-4, 3×10-4, 4×10-4, and the one which performs the most effective is chosen for additional NLP downstreaming duties. 


Let’s take a look at how the data2vec mannequin performs when it implements the methods mentioned above for various modalities. 

Laptop Imaginative and prescient

To judge the outcomes for pc imaginative and prescient, the data2vec mannequin is pre-trained on the photographs obtained from the ImageNet-1K dataset. The ensuing mannequin is fine-tuned utilizing the labeled information of the identical benchmark. As per the usual observe, the mannequin is then evaluated when it comes to top-1 accuracy on validation information. 

The outcomes are then distinguished on the idea of a single self-supervised mannequin, and coaching a separate visible tokenizer on extra information, or different self-supervised studying fashions. 

The desk under compares the efficiency of the data2vec mannequin for pc imaginative and prescient, and different current fashions: ViT-L, and ViT-B. 

The outcomes from the above desk may be summarized as follows. 

  • The data2vec mannequin outperforms prior work with each the ViT-L, and ViT-B fashions in single mannequin setting. 
  • The masked prediction setup used within the data2vec algorithm to foretell contextualized latent representations performs higher when in comparison with strategies that predict native targets like engineering picture options, enter pixels, or visible tokens. 
  • The data2vec mannequin additionally outperforms self-distillation strategies that regress the ultimate layer of the coed community whereas taking two totally different augmented variations of a picture as inputs. 

Audio & Speech Processing

For speech & audio processing, the data2vec mannequin is skilled on about 960 hours of audio information obtained from the Librispeech(LS-960) dataset. The dataset incorporates clear speech audio from audiobooks in English, and it’s handled as a typical benchmark within the speech & audio processing business. 

To investigate the mannequin’s efficiency in numerous useful resource settings, researchers have high-quality tuned the data2vec mannequin to make use of totally different quantities of labeled information(from a couple of minutes to a number of hours) for computerized speech recognition. To investigate the mannequin’s efficiency, data2vec is in contrast in opposition to HuBERT & wav2vec 2.0, two of the most well-liked algorithms for speech & audio illustration learnings that depend on discrete speech items. 

The above desk compares the efficiency of data2vec when it comes to phrase fee for speech recognition with different current fashions. LM represents the language mannequin used for decoding. The outcomes may be summarized as follows. 

  • The data2vec mannequin exhibits enhancements for many labeled information setups with the biggest acquire of 10 minutes of labeled information for Base fashions. 
  • With regards to giant fashions, the mannequin performs considerably higher on small labeled datasets, and the efficiency is comparable on resource-rich datasets with over 100 & 960 hours of labeled information. It’s as a result of the efficiency usually saturates on resource-rich labeled dataset for many fashions. 
  • After analyzing the efficiency, it may be deduced that when the mannequin makes use of wealthy contextualized targets, it’s not important to be taught discrete items. 
  • Studying contextualized targets throughout coaching helps in bettering the general efficiency considerably. 

Moreover, to validate data2vec’s method for speech recognition, the mannequin can also be skilled on the AudioSet benchmark. Though the pre-training setup for AudioSet is just like Librispeech, the mannequin is skilled for Okay= 12, and for over 200K updates, the place the dimensions of every batch is 94.5 minutes. 

The mannequin then applies the DeepNorm framework, and layer normalization to the targets to assist in stabilizing the coaching. Moreover, the mannequin can also be high-quality tuned on balanced subsets with batch measurement of 21.3 minutes over 13k updates. The mannequin additionally makes use of Linear Softmax Pooling and mixup with a likelihood rating of 0.7. The mannequin then provides a single linear projection into 527 distinctive lessons of audio, and units the projection studying fee to 2e-4. 

Moreover, the pre-trained parameters have a studying fee of 3e-5, and the mannequin makes use of masking strategies for high-quality tuning the dataset. The desk under summarizes the outcomes, and it may be seen that the data2vec mannequin is able to outperforming a comparable setup with the identical fine-tuning, and pre-training information. 

Pure Language Processing

To investigate data2vec’s efficiency on textual content, the mannequin follows the identical coaching setup as BERT and pre-training the mannequin on English Wikipedia dataset with over 1M updates, and batch measurement being 256 sequences. The mannequin is evaluated on the GLUE or Common Language Understanding Analysis benchmark that features pure language interference duties(MNLI or Multi Style Pure Language Inference), sentence similarity (QQP or Quora Query Pairs benchmark, MRPC or Microsoft Analysis Paragraph Corpus, and STS-B or Semantic Textual Similarity Benchmark), sentiment evaluation(SST-2 or Stanford Sentiment Treebank), and grammatically(CoLA). 

Moreover, to high-quality tune the data2vec mannequin, the labeled information is offered by every process, and the common accuracy is reported on the event units with 5 fine-tuning runs. The next desk summarizes the efficiency of the data2vec mannequin for Pure Language Processing duties, and compares it with different fashions. 

  • The above information exhibits that the data2vec mannequin outperforms the baseline RoBERTa mannequin because the technique in data2vec mannequin doesn’t use random targets. 
  • The data2vec mannequin is the primary profitable pre-trained NLP mannequin that doesn’t use discrete items like characters, phrases or sub-words as coaching targets. As an alternative, the data2vec framework predicts contextualized latent illustration over the entire unmasked textual content sequence. 
  • It helps in making a studying process by which the mannequin is required to foretell targets with particular properties from the present sequence quite than predicting representations which might be generic to each textual content unit with specific discretion. 
  • Moreover, the coaching goal set will not be mounted, and the mannequin is free to outline new targets, and it’s open to vocabulary settings. 

Data2Vec: Ablations Research

Ablation is a time period used to outline the removing of a element within the AI, and ML programs. An ablation research is used to analyze or analyze the efficiency of an AI or ML mannequin by eradicating sure key elements from the mannequin that enables researchers to know the contribution of that element within the total system. 

Layer Averaged Targets

A significant distinction between data2vec and different self-supervised studying fashions is that the data2vec mannequin makes use of targets which might be based mostly on averaging a number of layers from the instructor community. The concept comes from the truth that the highest high layers of the wav2vec 2.0 mannequin doesn’t carry out properly for downstream duties when in comparison with center layers of the mannequin. 

Within the following experiment, the efficiency of all three modalities is measured by averaging Okay= 1, 2, …, 12 layers the place Okay= 1 predicts solely the highest layer. Nonetheless, to extract quicker turnaround time, the data2vec trains the bottom mannequin with 12 layers in whole. For speech recognition, the mannequin is pre-trained on over 200 thousand updates on Librispeech, after which fine-tuned on a ten hour labeled break up of Libri-light. For Pure Language Processing, the mannequin reviews the common GLUE rating for the validation set, and pre-trains the mannequin for 300 epochs for pc imaginative and prescient & then reviews the top-1 accuracy obtained on the ImageNet dataset. 

The above determine exhibits that targets based mostly on a number of layers usually enhance when solely the highest layer Okay=1 is used for all modalities. Utilizing all of the layers out there is an effective observe because the neural networks construct options over various kinds of options, and quite a few layers which might be then extracted as characteristic layers. 

Utilizing options from a number of layers helps in boosting accuracy, and enriches the self-supervised studying course of. 

Goal Function Kind

The transformer blocks within the data2vec mannequin have a number of layers that may all function targets. To investigate how totally different layers have an effect on efficiency, the mannequin is pre-trained on Librispeech’s speech fashions that use totally different layers as goal options. 

The determine under clearly signifies that the output of the feed ahead community or the FFN works ideally whereas the output of the self-attention blocks don’t lead to a usable mannequin. 

Goal Contextualization

Instructor representations within the data2vec mannequin use self-attention over all the enter to provide contextualized targets. It’s what separates data2vec from different self-supervised studying fashions that assemble a studying process by reconstructing or predicting native elements of the enter. It evidently poses the query: does the data2vec mannequin require contextualized targets to work properly? 

To reply the query, the researchers assemble goal representations that do not need entry to all the enter dataset however solely a fraction of it that’s predetermined. The mannequin then restricts the self-attention mechanism of the instructor that enables it to entry solely a portion of surrounding atmosphere enter. After the mannequin has been skilled, it’s fine-tuned to entry the total context measurement. 

The determine under signifies that bigger context sizes usually result in a greater efficiency, and when all the enter pattern is seen, it yields the most effective accuracy. It additional proves that richer goal representations can yield higher efficiency. 

Modality Particular Function Extractors and Masking

The first goal of data2vec is to design a easy studying mechanism that may work with totally different modalities. It’s as a result of, though the present fashions and frameworks have a unified studying regime, they nonetheless use modality particular masking, and have extractors. 

It is smart that frameworks principally work with a single modality given the character of the enter information varies vastly from each other. For instance, speech recognition fashions use a excessive decision enter( like 10 kHz waveform) that normally have hundreds of samples. The waveform is then processed by the framework utilizing a multilayer convolutional neural community to acquire characteristic sequences of fifty Hz. 

Structured and Contextualized Targets

The primary differentiating level between the data2vec and different masked prediction fashions is that within the data2vec mannequin, the options of coaching targets are contextualized. These options are constructed utilizing self-attention of all the masked enter in instructor mode. 

Another frameworks like BYOL(Bootstrap Your Personal Latent) or DINO additionally use latent representations just like the data2vec, however their main focus is to be taught transformation invariant representations. 

Last Ideas

Current work within the AI and ML business have indicated that uniform mannequin architectures may be an efficient method to sort out a number of modalities. The data2vec mannequin makes use of a self-supervised studying method for working with three modalities: speech, photographs, and language. 

The important thing idea behind the data2vec mannequin is to make use of partial enter view to regress contextualized info or enter information. The method utilized by the data2vec frameworks is efficient because the mannequin performs higher than prior self-supervised studying fashions on ImageNet-1K dataset for each ViT-B, and ViT-L single fashions. 

Data2vec is trully a milestone within the self-supervised studying business because it demonstrates a single studying methodology for studying a number of modalities can certainly make it simpler for fashions to be taught throughout modalities. 

Leave a comment