Вопрос

How can you get a sentence embedding from BERT, how do sentence transformers differ, and why is this similar to metric learning for image pairs?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Basic BERT pooling uses CLS, mean or max pooling over token embeddings. Sentence transformers are trained with pair/triplet/contrastive objectives so sentence-level embeddings have useful similarity geometry, similar to image metric learning.

Полный разбор

A vanilla BERT encoder returns contextual token embeddings and often a CLS representation. To get a single sentence vector, common baselines are CLS pooling, mean pooling over non-padding tokens, max pooling, or a learned pooling head. CLS from vanilla BERT is not automatically a high-quality semantic embedding for nearest-neighbor search. Sentence transformers modify the training objective. They are usually fine-tuned on sentence pairs, triplets or contrastive data so that semantically close texts are close in embedding space and unrelated texts are far. This makes cosine similarity meaningful for retrieval, clustering or semantic search. The analogy to image metric learning is direct. For car-photo matching, hard positives and hard negatives plus contrastive/triplet losses teach the embedding space what should be close. For sentence transformers, text pairs/triplets play the same role. In both cases, mining hard negatives is often as important as the architecture.

Assume vanilla BERT CLS is always a good sentence embedding.
Describe sentence transformers as a completely different Transformer architecture.
Forget hard-negative mining in metric-learning setups.