Old dog, new tricks: Word Injection in the text embedding models

In real life, we rarely experience a mind-blowing moment that makes us rethink everything we know. Present experiences are rather built on top of past ones, and gathering knowledge is a gradual process, not a sudden one. When I learn a new word, I probably try to understand it based on the concepts I already know. If I know the word “dog”, I can easily understand the word “puppy”.

puppy = dog + small

Somehow, we accept that the representation models, a.k.a. embedding models, behave differently. They have a tokenizer with a fixed vocabulary, and if we want to teach them a new word, we have to extend the vocabulary and fine-tune the model with the new data. After the fine-tuning, all the computed embeddings of the documents are no longer valid, as the model parameters have changed. We must recompute them all and reindex the documents in the vector search engine. That’s acceptable for thousands of documents, but what if we have millions of them? That’s a costly operation, forcing us to plan the changes ahead. It’s like the model had to rethink everything it knows each time it learns something new.

What if I told you there is a way to keep a majority your embeddings kept intact and still iteratively learn new words? Let’s discuss the Word Injection technique.

Input token embeddings

The tokenizer gets the input string, divides it into a sequence of tokens, and then maps each token to a corresponding numerical ID. However, changing the order of the words in a sentence does not change the set of the input IDs we get. Their order is just different.

Order of the tokens does not matter

The sequence of IDs is passed to the model, which, in turn, uses them in a lookup table to get a sequence of input token embeddings. These are context-free, meaning that a particular token will always get the same embedding, no matter the order of the tokens in a sequence.

Input token embeddings are trainable model parameters learned based on the training data. The cross-token relationships are found later on, using an attention mechanism, but at the input level, the model gets the same vector for a particular token, no matter the context.

Because of the input token embeddings being context-independent, we can explore this semantic space easily, as there is a finite number of tokens in the vocabulary, and their corresponding embeddings are fixed. Mapping this space to 2D with TSNE is an interesting exercise.

TSNE visualization of the input token embeddings

It shows that semantically close tokens are also close to each other in that space, so the input token embeddings capture the initial meaning of each token. The screenshot above comes from the tokens visualization app I implemented to explore the input token embeddings of the Sentence Transformers models.

Practical implications of the tokenizer

There are a few issues with the tokenizer that may impact the quality of the embeddings computed by the model.

Unknown tokens

Each pre-trained embedding model has a fixed vocabulary with a specific size and a matrix of the input token embeddings with a length equal to the number of tokens in that vocabulary. However, in the wild, we may encounter situations where the tokenizer does not recognize some characters, and they are converted into the unknown tokens, typically [UNK]. This token has a single input token embedding assigned, and it does not capture any specific meaning.

Emoji is a good example of a token that is often converted into [UNK]. It is a popular way of expressing emotions in social media, and its sentiment is lost when we use a model that does not recognize it. Even if you think you are safe, using an unknown token may pollute your data even further, as the pre-tokenization won’t treat such a character as a separator (please note there is no space between “happy” and “😊”).

We lost one of the meaningful tokens, and the model will not be able to capture the sentiment of the text for sure.

Using the unknown token has some further implications. This is a special token only for the tokenizer layer, but it has a single input embedding assigned by the model, as any other token. If we have two different emojis converted into [UNK], their embeddings will be identical, even if the meaning of the text is different.

Our model is not able to capture the meaning of the emojis, and it is a significant issue if we want to analyze social media data. But that’s just a silly example, of course. The same issue may happen if you work with non-English data, and the tokenizer does not recognize some characters.

One of the possible ways of detecting if your data is polluted is to check the measure the ratio of unknown tokens. It’s a very simple metric, but it may give you a hint that something is wrong with the tokenizer. Perfectly, you should also collect all the problematic words and check if they are meaningful in the context of your data.

Multi-token words

Another problem that may arise is when the tokenizer splits a word into multiple tokens, and these tokens are not any kind of root forms of words. For example, the name of the company may be split into pieces, and each piece is just a sequence of a few characters, with no specific meaning.

Qdrant represented as q, ##dran, ##t

Qdrant is a vector search engine, a.k.a. vector database, but the model does not recognize the name of the company as a single token. Instead, it is split into three tokens, q, ##dran, and ##t. These letters ay occur in multiple contexts, and they do not have any specific meaning at the input token embeddings layer. Perfectly, the attention mechanism should capture the relationships between these tokens, and still understand the concept, but that’s not the case.

First sentence	Second sentence	Similarity
`Qdrant`	`vector search engine`	`0.1953`
`Qdrant`	`vector database`	`0.1627`
`Qdrant`	`R.K. Narayan`	`0.4126`
`Qdrant`	`semantic search`	`0.1075`

R.K. Narayan is an Indian writer. Please do not ask me why the model thinks that Qdrant is more similar to him than to vector search engine. However, the input embedding of the ##dran token is closest to the input embeddings of bahadur and narayan.

In many practical cases we want our system to understand the meaning of some proper names. An obvious solution is to add the name of the company to the vocabulary, and fine-tune the model with the new data. That process will modify the model parameters, to better understand the concepts behind the new tokens and their usage in the context of the data. However, it will also impact the embeddings of the documents we have computed so far, and we will have to recompute them all. What if you have computed the embeddings for millions of documents and only a small fraction of them contain the name of the company you are interested in? You still have to recompute all the embeddings, as the model parameters have changed and the old embeddings are no longer valid.

There is a cure for that, and it is called Word Injection.

Word Injection: unfixing the model vocabulary

Word Injection works precisely on a level of the input token embeddings. If we see a popular word is converted into an unknown token (typically [UNK]), or it is split into pieces by the tokenizer, we may prefer to learn the meaning of that word in the semantic space of the input token embeddings. Moreover, we do not modify any other model parameters except for the input token embeddings, and at this layer, we only modify the input embeddings of the newly added tokens.

Freezing all the model parameters except for the input token embeddings of the newly added tokens

The assumption is, a pre-trained model already has a good semantic space defined by the input token embeddings, and we only need to extend it with the new tokens.

I have implemented the Word Injection technique for the Sentence Transformers models as a small Python library. Let’s see how it might be used in practice. We are going to extend the all-MiniLM-L6-v2 model with some emojis.

from word_injection.model import ExtendableSentenceTransformer

model = ExtendableSentenceTransformer("all-MiniLM-L6-v2")

Adding new tokens

ExtendableSentenceTransformer is a thin wrapper around the SentenceTransformer model, and it exposes some utility methods, but most importantly, it allows us to add new tokens to the vocabulary easily.

emojis = ["🥰", "🤔", "🤣", "🙏", "🔥", "💯", "👏", "🤨", "🎉", "🙌"]
for emoji in emojis:
    model.add_token(emoji)

The add_token method extends the vocabulary of the model with the new token, and it initializes the input token embedding randomly. Alternatively, you can also initialize it to the average of the embeddings of the tokens that you think should be semantically closest to the new token in the input space:

model.add_token(
    "qdrant",
    init_from_tokens=["vector", "database", "search", "engine", "semantic"]
)
model.get_closest_input_embeddings("qdrant", k=10)
# Out: {
#   'qdrant': 1.0,
#   'engine': 0.5607988238334656,
#   'database': 0.5517233610153198,
#   'search': 0.5276899337768555,
#   'databases': 0.48338615894317627,
#   'engines': 0.4706612527370453,
#   'vector': 0.45087435841560364,
#   'searching': 0.43133625388145447,
#   'semantic': 0.42511624097824097,
#   'searches': 0.3765649199485779
# }

Training dataset

Since the ExtendableSentenceTransformer is compatible with Sentence Transformers, you can train it in a supervised or unsupervised manner, using any of the techniques available. Our model is supposed to be used for semantic search later on, so it makes sense to train it so the cosine similarity between the embeddings of the similar documents is maximized. For that purposes we need a dataset with the pairs of sentences and their similarity scores.

sentence1	sentence2	score
Baby Should Be Off Soon 🥰	Baby Should Be Off Soon. I am looking forward to it!	0.9
you mean ross and rachel ? yeah 🥰	You mean Ross and Rachel? Yeah! I love them.	0.9
Not going anywhere 🥰	I’m not going anywhere. I’m happy to be here.	0.7
@CryoXIV Cursed chip from 2 years ago 🤔😳	@CryoXIV Cursed chip from 2 years ago I’m surprised.	0.8
@CardinalCathboy What is in the jar 🤔	@CardinalCathboy What is in the jar? I am curious.	0.9
@EndWokeness Wow! 🤔🧐 You’re right!	@EndWokeness Wow! You’re right! I’m impressed.	0.8

A full version of the dataset is available here. It was created synthetically with Large Language Models by providing them with the sentence containing emojis and asking them to generate a counterpart sentence without emojis and the similarity score between them.

The dataset might be loaded with the HuggingFace datasets library:

from datasets import Dataset

dataset = Dataset.from_csv("./data/train.csv")
split = dataset.train_test_split(test_size=0.1)
train_dataset = split["train"]
eval_dataset = split["test"]

Fine-tuning the model

Using cosine similarity as a loss function makes sense for the semantic search task, so let’s define it:

from sentence_transformers.losses import CosineSimilarityLoss

train_loss = CosineSimilarityLoss(model=model)

We also want to use it for the evaluation, so here is how we create an evaluator:

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers import SimilarityFunction

evaluator = EmbeddingSimilarityEvaluator(
    sentences1=eval_dataset["sentence1"],
    sentences2=eval_dataset["sentence2"],
    scores=eval_dataset["score"],
    main_similarity=SimilarityFunction.COSINE,
)

Now it’s time for defining the training arguments. There isn’t anything special about them, as we are using the SentenceTransformer model, and the training process is the same as for any other model:

from sentence_transformers import SentenceTransformerTrainingArguments

train_args = SentenceTransformerTrainingArguments(
    output_dir="fine-tuned-all-MiniLM-L6-v2",
    overwrite_output_dir=True,
    num_train_epochs=100,
    per_device_train_batch_size=1024,
    per_device_eval_batch_size=1024,
    warmup_ratio=0.1,
    learning_rate=10e-3,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    save_steps=1000,
    logging_steps=10,
)

Finally, we can define the trainer:

from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=train_loss,
    evaluator=evaluator,
)

The ExtendableSentenceTransformer model automatically freezes all the model parameters except for the input token embeddings. However, we still need to make sure only the newly added tokens have their embeddings modified during the training process. We can achieve that by using the ResetFixedEmbeddingsCallback, which stores the weights before the training and resets them after each batch, except for the trainable indices:

from word_injection.callback import ResetFixedEmbeddingsCallback

reset_callback = ResetFixedEmbeddingsCallback()
for emoji in emojis:
    token_id = model.token_to_id(emoji)
    reset_callback.add_trainable_index(token_id)

# Add the callback to reset fixed embeddings after each epoch
trainer.callback_handler.add_callback(reset_callback)

That’s not the most effective way of implementing that mechanism, but Word Injection is still in a PoC phase. Perhaps in the future, the newly added tokens will be stored in a separate matrix, and the training process will be more efficient. Currently, there was no way to make just some of the matrix rows trainable in PyTorch, so we have to reset the fixed embeddings after each batch.

Here is how you can start the training process:

trainer.train()

If you prefer watching videos, the whole process of fine-tuning the sentence transformer model with Word Injection is also available on Qdrant YouTube channel.

The results of the fine-tuning

Obviously, the model is not going to be perfect, but it should be able to capture the meaning of the emojis and the proper names. Let’s see how the closest embeddings of the 🤣 emoji look like:

model.get_closest_input_embeddings("🤣", k=10)
# Out: {
#   '🤣': 1.0,
#   '🥰': 0.31942179799079895,
#   '🤔': 0.3059849143028259,
#   '💯': 0.2928054928779602,
#   'laughed': 0.2544054090976715,
#   'amused': 0.24455296993255615,
#   'laughs': 0.23454934358596802,
#   'funny': 0.23370203375816345,
#   'laughter': 0.22590281069278717,
#   'amusing': 0.22161859273910522
}

We started from a random point in the semantic space, and the closest embeddings of this particular emoji seem to make sense. Of course, such a simple and relatively short training process won’t make the model perfect.

Limitations

The Word Injection technique is still under evaluation, and it is not a complete replacement for the full fine-tuning. The initial semantic space defined by the input token embeddings does not change, so a general-purpose model remains general-purpose and may not be able to capture the nuances of a specific domain, such as medicine or legal. Thus, Word Injection is not a complete replacement of the full fine-tuning but rather an alternative for these cases in which partial backward compatibility is crucial.

The hypothesis is that the best results might be achieved for tokens that might be precisely expressed with the current vocabulary. That would indicate that the meaning of the newly added tokens should already be captured by the semantic space of the input token embeddings, and the tokenizer just missed them during its training. Generally, it is hard to formalize the conditions under which Word Injection might be applicable, and it remains an open question.

Input token embeddings#

Practical implications of the tokenizer#

Unknown tokens#

Multi-token words#

Word Injection: unfixing the model vocabulary#

Adding new tokens#

Training dataset#

Fine-tuning the model#

The results of the fine-tuning#

Limitations#