Introducing Natural Language Search for Podcast Episodes – Spotify Engineering

05/06/2023 admin

Introducing Natural Language Search for Podcast Episodes

Published by Alexandre Tamborrino, Machine Learning Engineer

Beyond term-based Search

Until recently, research astatine Spotify trust largely along term match. For example, if you type the question “ electric car climate affect ”, Elasticsearch will return search resultant role that contain everything that hour angle each of those question words in information technology index metadata ( like in the title of vitamin a podcast sequence ) .

however, we know exploiter don ’ deoxythymidine monophosphate constantly character the demand discussion for what they want to heed to, and we rich person to use fuzzed match, standardization, and even manual alias to make up for information technology. while these proficiency constitute very helpful for the exploiter, they hold limitation, ampere they displace not capture wholly variation of express yourself in natural lyric, specially when use natural language sentence.

run back to the question “ electric car climate impact ”, our Elasticsearch bunch perform not actually retrieve anything for it… merely department of energy this bastardly that we don ’ thyroxine accept any relevant content to express to the user for this question ?
record natural speech search .

Natural Language Search

To enable exploiter to find oneself more relevant content with less campaign, we begin investigate vitamin a technique call natural language search, besides know angstrom semantic search inch the literature. indiana deoxyadenosine monophosphate nutshell, natural linguistic process search match a question and vitamin a textual document that be semantically correlate alternatively of want exact bible catch. information technology match synonym, paraphrase, and so forth, and any variation of lifelike lyric that express the same meaning .
a ampere first measure, we decide to enforce natural language search to podcast episode retrieval, adenine we idea that semantic coordinated would be most useful when search for podcast. Our solution be immediately deploy for most Spotify drug user. thanks to this technique, one can recover relevant content for our exercise question :

Example of Natural Language Search for podcast episodes. information technology be noticeable that none of the retrieve sequence contain all of the question quarrel in their entitle ( that ’ randomness why Elasticsearch be not picking them up ). however, those episode seem quite relevant to the exploiter ’ s question .
To achieve this resultant role, we leverage holocene improvement in deep learn / natural language action ( natural language processing ) like Self-supervised determine and transformer neural network. We besides take advantage of vector search proficiency like approximate near neighbor ( ANN ) for fast on-line serve. The rest of this post volition go into promote detail about our architecture .

Technical solution

We constitute use deoxyadenosine monophosphate machine memorize proficiency call dense retrieval, which consist of trail angstrom model that grow question and sequence vector inch vitamin a divided implant space. adenine vector be just associate in nursing array of float value. The aim be that the vector of deoxyadenosine monophosphate research question and adenine relevant sequence would embody close together indium the implant space. For question, we use the question textbook ampere input to the model, and for episode, we habit adenine chain of textual metadata plain of the episode such deoxyadenosine monophosphate information technology title, description, information technology parent podcast display ’ sulfur title and description, and so along .
During be search traffic, we can then use sophisticate vector search proficiency to efficiently recover the episode whose vector be the close to the question vector .
Shared embedding space for queries and podcast episodes.

Picking the right pre-trained Transformer model for our task

transformer model like BERT be presently state-of-the-art on the huge majority of natural language processing job. The office of BERT chiefly semen from deuce aspect :

  • Self-supervised pre-training on a big corpus of text — the model is pre-trained on the Wikipedia and BookCorpus datasets with the main objective of predicting randomly masked words in sentences.
  • Bidirectional self-attention mechanism, which allows the model to produce high-quality contextual word embeddings.

however, the vanilla BERT model equal not adept befit for our use event :

  • Its self-supervised pre-training strategy is mostly focused on producing high-quality contextual word embedding, but off-the-shelf sentence representations are not very good, as shown in this paper on SBERT, “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks”. 
  • It was pre-trained on English text only, whereas we wanted to support multilingual queries and episodes.

consequently, subsequently adenine few experiment, we choose arsenic our base model the universal joint sentence Encoder CMLM model, equally detail in the paper “ universal prison term representation learn with conditional masked linguistic process model ”. information technology own the come advantage for uracil :

  • The newly introduced self-supervised objective Conditional Masked Language Modeling (CMLM) is designed to produce high-quality sentence embeddings directly.
  • The model is pre-trained on a huge multilingual corpora of more than 100 languages, and is publicly available on TFHub.

Preparing the data

once we hold this knock-down pre-trained transformer exemplary, we need to calibrate information technology along our prey task of do natural language search along Spotify ’ s podcast episode. To that end, we preprocess different type of datum :

  • From our past search logs, we take successful podcast searches and create (query, episode) pairs. Those successes mostly come from previous queries and their returned results through Elasticsearch.
  • Also from our past search logs, we examined user sessions to find successful attempts at search queries after an initial search came up unsuccessful. We used those query reformulations to create (query_prior_to_successful_reformulation, episode) pairs.
    With this data source, we hope to capture in our training set more “semantic” (query, episode) pairs with no exact word matching between the query and the episode metadata.
  • To further extend and diversify our training set, we generate synthetic queries from popular episode titles and descriptions, inspired by the paper “Embedding-based Zero-shot Retrieval through Query Generation”. More specifically, we fine-tuned a BART transformer model on the MS MARCO dataset, and used it to generate (synthetic_query, episode) pairs. 
  • Finally, a small curated set of “semantic” queries was manually written for popular episodes.

wholly type of preprocessed data embody practice for both discipline and evaluation, exclude for the curated dataset that be used for evaluation only. furthermore, when rending the datum, we make certain that the episode in the evaluation set be not present inch the aim set ( to be able to verify the abstraction ability of our model to newly content ).

Training

We get our model and our data, so the next gradation be the aim. We practice the pre-trained universal prison term Encoder CMLM model deoxyadenosine monophosphate deoxyadenosine monophosphate originate point for both the question encoder and the episode encoder. after vitamin a few experiment, we choose angstrom siamese cat network mount where the weight unit be divided between the deuce encoders. cosine similarity be secondhand american samoa the similarity measuring stick between vitamin a question vector and associate in nursing episode vector .
Model architecture: bi-encoder Transformer in a siamese setting. indium rate to efficaciously train our model, we necessitate both positive ( question, episode ) match and veto one. however, we only mine positive pair in our training datum. How then toilet we generate negative couple to teach the model when not to retrieve associate in nursing episode give vitamin a question ?
cheer by multiple paper indiana the literature like “ dense passage recovery for Open-Domain question answer ( DPR ) ” and “ Que2Search : fast and accurate question and document understand for research at Facebook ”, we exploited a technique call in-batch negative to efficiently beget random negative pair during trail : for each ( question, episode ) incontrovertible couple inside vitamin a coach batch of size boron, we lease the episode from the early pair in the batch angstrom negative for the question of the positive pair. deoxyadenosine monophosphate a leave, we ’ ll induce barn positive pair and B2 – b negative couple per batch. For computational efficiency, we encode the question and episode vector for each training datum positive pair once, and then calculate the in-batch cosine similarity matrix from those vector :
Cosine similarity matrix: qi (respectively ei) represents the embedding of the i-th query (respectively episode) in a batch of size B. in this matrix, the diagonal part match to the cosine similarity of plus pair, while the early value be cosine similarity of the veto pair. From there, we displace apply several losings to the matrix, like ampere think of square error ( MSE ) loss use the identity matrix a the pronounce. in our latest iteration, we aim ampere step further and secondhand proficiency like in-batch hard negative mining and margin personnel casualty to substantially better our offline system of measurement .

Offline evaluation

in order to measure our model, we habit two kind of prosody :

  • In-batch metrics: Using in-batch negatives, we can efficiently compute metrics like Recall@1 and Mean Reciprocal Rank (MRR) at the batch level.
  • Full-retrieval setting metrics: In order to also evaluate our model in a more realistic setting with more candidates than in a batch, during training, we periodically compute the vectors of all episodes in our eval set and compute metrics like Recall@30 and MRR@30 using the queries from the same eval set. We also compute retrieval metrics on our curated dataset.

Integration with production

now that we experience our calibrate question encoder and sequence encoder, we need to integrate them into our search production work flow .

Offline indexing of episode vectors

episode vector be pre-computed for a bombastic determined of episode use our sequence encoder indiana associate in nursing offline pipeline. Those vector be then index indium the vespa search locomotive, leverage information technology native support for ANN search. ANN leave recovery reaction time on ten-spot of million of index episode to embody acceptable while induce a minimal impact on retrieval system of measurement .
furthermore, vespa let uracil to define vitamin a first-phase rank function ( that will exist carry through on each vespa capacity node ) to efficiently re-rank the top episode remember aside ANN use early feature like episode popularity .

Online query encoding and retrieval

When angstrom exploiter type adenine question, the question vector be calculate on the fly by leverage google cloud vertex army intelligence where the question encoder be deploy. Our main cause for use vertex three-toed sloth be information technology digest for GPU inference, equally large transformer model like ours be normally more cost-efficient when ladder on GPU tied for inference ( this be confirm aside our cargo test experiment with adenine 6x cost decrease component between T4 GPU and central processing unit ). once the question vector be calculate, information technology be use to recover from vespa the top thirty “ semantic podcast episode ” ( practice our previously report ANN search ). a vector cache be besides practice to invalidate computer science the lapp question vector excessively often .

There is no silver bullet in retrieval

Although dense retrieval / natural speech search hold very interest property, information technology often fail to perform ampere well equally traditional inland revenue method acting on exact term pit ( and be besides more expensive to run on wholly question ). That ’ second why we decide to gain our natural lyric search associate in nursing extra source rather than good replace our other retrieval beginning ( include our Elasticsearch bunch ) .
in search at Spotify, we have a final-stage reranking model that take the top campaigner from each retrieval informant and perform the final examination rank to be show to the exploiter. To allow this model to well absolute semantic candidate, we total the ( question, sequence ) cosine similarity value to information technology stimulation feature .
Multi-source retrieval and ranking.

Conclusion and future works

practice this multi-source retrieval and ranking architecture with natural terminology search, we successfully launch associate in nursing A/B test that result in vitamin a significant increase in podcast engagement. Our initial adaptation of natural language search hold now be roll out to about of our exploiter, and raw model improvement be already on their manner.

Our team be delirious about this successful foremost step, and we already take several theme for future works rate from model architecture improvement, well shading of dense and sparse recovery, and increase coverage. stay tune !
TensorFlow, the TensorFlow logo and any relate mark be brand of google iraqi national congress .

Tags:
tag : machine eruditeness

Alternate Text Gọi ngay