In real life, we rarely experience a mind-blowing moment that makes us rethink everything we know. Present experiences are rather built on top of past ones, and gathering knowledge is a gradual process, not a sudden one. When I learn a new word, I probably try to understand it based on the concepts I already know. If I know the word “dog”, I can easily understand the word “puppy”.
Somehow, we accept that the representation models, a.k.a. embedding models, behave differently. They have a tokenizer with a fixed vocabulary, and if we want to teach them a new word, we have to extend the vocabulary and fine-tune the model with the new data. After the fine-tuning, all the computed embeddings of the documents are no longer valid, as the model parameters have changed. We must recompute them all and reindex the documents in the vector search engine. That’s acceptable for thousands of documents, but what if we have millions of them? That’s a costly operation, forcing us to plan the changes ahead. It’s like the model had to rethink everything it knows each time it learns something new.
What if I told you there is a way to keep a majority your embeddings kept intact and still iteratively learn new words? Let’s discuss the Word Injection technique.
Input token embeddings
The tokenizer gets the input string, divides it into a sequence of tokens, and then maps each token to a corresponding numerical ID. However, changing the order of the words in a sentence does not change the set of the input IDs we get. Their order is just different.
The sequence of IDs is passed to the model, which, in turn, uses them in a lookup table to get a sequence of input token embeddings. These are context-free, meaning that a particular token will always get the same embedding, no matter the order of the tokens in a sequence.
Input token embeddings are trainable model parameters learned based on the training data. The cross-token relationships are found later on, using an attention mechanism, but at the input level, the model gets the same vector for a particular token, no matter the context.
Because of the input token embeddings being context-independent, we can explore this semantic space easily, as there is a finite number of tokens in the vocabulary, and their corresponding embeddings are fixed. Mapping this space to 2D with TSNE is an interesting exercise.
It shows that semantically close tokens are also close to each other in that space, so the input token embeddings capture the initial meaning of each token. The screenshot above comes from the tokens visualization app I implemented to explore the input token embeddings of the Sentence Transformers models.
Practical implications of the tokenizer
There are a few issues with the tokenizer that may impact the quality of the embeddings computed by the model.
Unknown tokens
Each pre-trained embedding model has a fixed vocabulary with a specific size and a matrix of the input token embeddings
with a length equal to the number of tokens in that vocabulary. However, in the wild, we may encounter situations where
the tokenizer does not recognize some characters, and they are converted into the unknown tokens, typically [UNK]
.
This token has a single input token embedding assigned, and it does not capture any specific meaning.
Emoji is a good example of a token that is often converted into [UNK]
. It is a popular way of expressing emotions in
social media, and its sentiment is lost when we use a model that does not recognize it. Even if you think you are safe,
using an unknown token may pollute your data even further, as the pre-tokenization won’t treat such a character as a
separator (please note there is no space between “happy” and “😊”).
We lost one of the meaningful tokens, and the model will not be able to capture the sentiment of the text for sure.
Using the unknown token has some further implications. This is a special token only for the tokenizer layer, but it has
a single input embedding assigned by the model, as any other token. If we have two different emojis converted into
[UNK]
, their embeddings will be identical, even if the meaning of the text is different.
Our model is not able to capture the meaning of the emojis, and it is a significant issue if we want to analyze social media data. But that’s just a silly example, of course. The same issue may happen if you work with non-English data, and the tokenizer does not recognize some characters.
One of the possible ways of detecting if your data is polluted is to check the measure the ratio of unknown tokens. It’s a very simple metric, but it may give you a hint that something is wrong with the tokenizer. Perfectly, you should also collect all the problematic words and check if they are meaningful in the context of your data.
Multi-token words
Another problem that may arise is when the tokenizer splits a word into multiple tokens, and these tokens are not any kind of root forms of words. For example, the name of the company may be split into pieces, and each piece is just a sequence of a few characters, with no specific meaning.
Qdrant is a vector search engine, a.k.a. vector database, but the model does not recognize the
name of the company as a single token. Instead, it is split into three tokens, q
, ##dran
, and ##t
. These letters
ay occur in multiple contexts, and they do not have any specific meaning at the input token embeddings layer. Perfectly,
the attention mechanism should capture the relationships between these tokens, and still understand the concept, but
that’s not the case.
First sentence | Second sentence | Similarity |
---|---|---|
Qdrant | vector search engine | 0.1953 |
Qdrant | vector database | 0.1627 |
Qdrant | R.K. Narayan | 0.4126 |
Qdrant | semantic search | 0.1075 |
R.K. Narayan is an Indian writer. Please do not ask me why the model
thinks that Qdrant
is more similar to him than to vector search engine. However, the input embedding of the ##dran
token is closest to the input embeddings of bahadur
and narayan
.
In many practical cases we want our system to understand the meaning of some proper names. An obvious solution is to add the name of the company to the vocabulary, and fine-tune the model with the new data. That process will modify the model parameters, to better understand the concepts behind the new tokens and their usage in the context of the data. However, it will also impact the embeddings of the documents we have computed so far, and we will have to recompute them all. What if you have computed the embeddings for millions of documents and only a small fraction of them contain the name of the company you are interested in? You still have to recompute all the embeddings, as the model parameters have changed and the old embeddings are no longer valid.
There is a cure for that, and it is called Word Injection.
Word Injection: unfixing the model vocabulary
Word Injection works precisely on a level of the input token embeddings. If we see a popular word is converted into
an unknown token (typically [UNK]
), or it is split into pieces by the tokenizer, we may prefer to learn the meaning
of that word in the semantic space of the input token embeddings. Moreover, we do not modify any other model
parameters except for the input token embeddings, and at this layer, we only modify the input embeddings of the newly
added tokens.
The assumption is, a pre-trained model already has a good semantic space defined by the input token embeddings, and we only need to extend it with the new tokens.
I have implemented the Word Injection technique for the Sentence Transformers models as a small
Python library. Let’s see how it might be used in practice. We are going to
extend the all-MiniLM-L6-v2
model with some emojis.
from word_injection.model import ExtendableSentenceTransformer
model = ExtendableSentenceTransformer("all-MiniLM-L6-v2")
Adding new tokens
ExtendableSentenceTransformer
is a thin wrapper around the SentenceTransformer
model, and it exposes some utility
methods, but most importantly, it allows us to add new tokens to the vocabulary easily.
emojis = ["🥰", "🤔", "🤣", "🙏", "🔥", "💯", "👏", "🤨", "🎉", "🙌"]
for emoji in emojis:
model.add_token(emoji)
The add_token
method extends the vocabulary of the model with the new token, and it initializes the input token
embedding randomly. Alternatively, you can also initialize it to the average of the embeddings of the tokens that you
think should be semantically closest to the new token in the input space:
model.add_token(
"qdrant",
init_from_tokens=["vector", "database", "search", "engine", "semantic"]
)
model.get_closest_input_embeddings("qdrant", k=10)
# Out: {
# 'qdrant': 1.0,
# 'engine': 0.5607988238334656,
# 'database': 0.5517233610153198,
# 'search': 0.5276899337768555,
# 'databases': 0.48338615894317627,
# 'engines': 0.4706612527370453,
# 'vector': 0.45087435841560364,
# 'searching': 0.43133625388145447,
# 'semantic': 0.42511624097824097,
# 'searches': 0.3765649199485779
# }
Training dataset
Since the ExtendableSentenceTransformer
is compatible with Sentence Transformers, you can train it in a supervised or
unsupervised manner, using any of the techniques
available. Our model is supposed to be used for
semantic search later on, so it makes sense to train it so the cosine similarity between the embeddings of the similar
documents is maximized. For that purposes we need a dataset with the pairs of sentences and their similarity scores.
sentence1 | sentence2 | score |
---|---|---|
Baby Should Be Off Soon 🥰 | Baby Should Be Off Soon. I am looking forward to it! | 0.9 |
you mean ross and rachel ? yeah 🥰 | You mean Ross and Rachel? Yeah! I love them. | 0.9 |
Not going anywhere 🥰 | I’m not going anywhere. I’m happy to be here. | 0.7 |
@CryoXIV Cursed chip from 2 years ago 🤔😳 | @CryoXIV Cursed chip from 2 years ago I’m surprised. | 0.8 |
@CardinalCathboy What is in the jar 🤔 | @CardinalCathboy What is in the jar? I am curious. | 0.9 |
@EndWokeness Wow! 🤔🧐 You’re right! | @EndWokeness Wow! You’re right! I’m impressed. | 0.8 |
A full version of the dataset is available here. It was created synthetically with Large Language Models by providing them with the sentence containing emojis and asking them to generate a counterpart sentence without emojis and the similarity score between them.
The dataset might be loaded with the HuggingFace datasets library:
from datasets import Dataset
dataset = Dataset.from_csv("./data/train.csv")
split = dataset.train_test_split(test_size=0.1)
train_dataset = split["train"]
eval_dataset = split["test"]
Fine-tuning the model
Using cosine similarity as a loss function makes sense for the semantic search task, so let’s define it:
from sentence_transformers.losses import CosineSimilarityLoss
train_loss = CosineSimilarityLoss(model=model)
We also want to use it for the evaluation, so here is how we create an evaluator:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers import SimilarityFunction
evaluator = EmbeddingSimilarityEvaluator(
sentences1=eval_dataset["sentence1"],
sentences2=eval_dataset["sentence2"],
scores=eval_dataset["score"],
main_similarity=SimilarityFunction.COSINE,
)
Now it’s time for defining the training arguments. There isn’t anything special about them, as we are using the
SentenceTransformer
model, and the training process is the same as for any other model:
from sentence_transformers import SentenceTransformerTrainingArguments
train_args = SentenceTransformerTrainingArguments(
output_dir="fine-tuned-all-MiniLM-L6-v2",
overwrite_output_dir=True,
num_train_epochs=100,
per_device_train_batch_size=1024,
per_device_eval_batch_size=1024,
warmup_ratio=0.1,
learning_rate=10e-3,
eval_strategy="epoch",
save_strategy="epoch",
save_total_limit=2,
save_steps=1000,
logging_steps=10,
)
Finally, we can define the trainer:
from sentence_transformers import SentenceTransformerTrainer
trainer = SentenceTransformerTrainer(
model=model,
args=train_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=train_loss,
evaluator=evaluator,
)
The ExtendableSentenceTransformer
model automatically freezes all the model parameters except for the input token
embeddings. However, we still need to make sure only the newly added tokens have their embeddings modified during the
training process. We can achieve that by using the ResetFixedEmbeddingsCallback
, which stores the weights before
the training and resets them after each batch, except for the trainable indices:
from word_injection.callback import ResetFixedEmbeddingsCallback
reset_callback = ResetFixedEmbeddingsCallback()
for emoji in emojis:
token_id = model.token_to_id(emoji)
reset_callback.add_trainable_index(token_id)
# Add the callback to reset fixed embeddings after each epoch
trainer.callback_handler.add_callback(reset_callback)
That’s not the most effective way of implementing that mechanism, but Word Injection is still in a PoC phase. Perhaps in the future, the newly added tokens will be stored in a separate matrix, and the training process will be more efficient. Currently, there was no way to make just some of the matrix rows trainable in PyTorch, so we have to reset the fixed embeddings after each batch.
Here is how you can start the training process:
trainer.train()
If you prefer watching videos, the whole process of fine-tuning the sentence transformer model with Word Injection is also available on Qdrant YouTube channel.
The results of the fine-tuning
Obviously, the model is not going to be perfect, but it should be able to capture the meaning of the emojis and the proper names. Let’s see how the closest embeddings of the 🤣 emoji look like:
model.get_closest_input_embeddings("🤣", k=10)
# Out: {
# '🤣': 1.0,
# '🥰': 0.31942179799079895,
# '🤔': 0.3059849143028259,
# '💯': 0.2928054928779602,
# 'laughed': 0.2544054090976715,
# 'amused': 0.24455296993255615,
# 'laughs': 0.23454934358596802,
# 'funny': 0.23370203375816345,
# 'laughter': 0.22590281069278717,
# 'amusing': 0.22161859273910522
}
We started from a random point in the semantic space, and the closest embeddings of this particular emoji seem to make sense. Of course, such a simple and relatively short training process won’t make the model perfect.
Limitations
The Word Injection technique is still under evaluation, and it is not a complete replacement for the full fine-tuning. The initial semantic space defined by the input token embeddings does not change, so a general-purpose model remains general-purpose and may not be able to capture the nuances of a specific domain, such as medicine or legal. Thus, Word Injection is not a complete replacement of the full fine-tuning but rather an alternative for these cases in which partial backward compatibility is crucial.
The hypothesis is that the best results might be achieved for tokens that might be precisely expressed with the current vocabulary. That would indicate that the meaning of the newly added tokens should already be captured by the semantic space of the input token embeddings, and the tokenizer just missed them during its training. Generally, it is hard to formalize the conditions under which Word Injection might be applicable, and it remains an open question.