基于文本相似度的搜刮推荐点击猜想模型Improvement of the Recommended Click Prediction Model Based on Text Similarity

In order to further improve the accuracy of the recommended content click prediction in search engine, a method based on the similarity feature of search content is proposed. The structure of the method is composed of multiple decision tree models. The Hierarchical softmax is used to convert the result to binary classification results. In order to understand the semantics of the user’s search text, the user input text similarity with the recommended content title, high-frequency related content, and recommended content tags is used to increase the accuracy of the click prediction model. The segmentation of words in the text is performed using the jieba segmentation, and word2cev is used to train all the words and construct a word vector model. Finally, Light GBM is used to build the prediction model. Then, 50,000 of the 2.05 million users’ search records are taken as the verification set and the rest as the training set. Experimental results show that the accuracy of the model is improved after adding similarity features.

1. 引言

2. 相干任务

2.1. 分词

2.2. 词向量

2.3. 神经搜集说话模型(NNLM)

NNLM (Neural network language model)是在2003年由Bengio提出来的 [9] ，如今已被广泛应用语音辨认体系 [10] 和高低文分析 [11] 等。NNLM的道理是应用前n个词来猜想最后一个词。神经搜集模型应用Distributed Representation词向量来表示词语并作为输入。神经搜集模型分为4层：输入层、投影层、隐蔽层、输入层(如图1所示)。

Figure 1. Neural network language model structure diagram

$f\left({w}_{t},{w}_{t-1},\cdots ,{w}_{t-n+2},{w}_{t-n+1}\right)=p\left({w}_{t}|{w}_{1}^{t-1}\right)$ (1)

$L=\frac{1}{T}\underset{t}{\sum }\mathrm{log}f\left({w}_{t},{w}_{t-1},\cdots ,{w}_{t-n+2},{w}_{t-n+1};\theta \right)+R\left(\theta \right)$ (2)

2.4. word2vec

word2vec是谷歌公司在2013年开放的一款词向量练习模型，可以根据给定的语料库，经过过程优化后的模型将单词练习成向量的情势。再应用word2vec计算出关键字的语义类似度 [12] 。word2vec依附skip-grams或许CBOW来建立词嵌入。在word2vec中，应用是层次化softmax (Hierarchical softmax)停止归一化 [13] ，改良了传统softamx的运算效力。

CBOW与Skip-Grams

Figure 2. Schematic diagram of CBOW and Skip-gram

3. 本文任务

3.1. 分词纠错

N-gram子串：{[广东，工业，大年夜学]，[工业，大年夜学，是]，[大年夜学，是，以]，[是，以，工]，[以，工，为主]，[为主，的，多学科]，[的，多学科，调和]，[多学科，调和，生长]，[调和，生长，的]，[生长，的，大年夜学]}

Figure 3. Substring of n-gram

3.2. 经过过程点击率和词类似度特点构建点击猜想模型

Table 1. Feature

Figure 4. Feature of click rate

Figure 5. Features with added similarity

Figure 6. A model for the Boosting

3.3. 实验成果分析与比较

Table 2. Sample data

Table 3. Data field description

TP (True Positive)真阳性：猜想为正，实际也为正

FP (False Positive)假阳性：猜想为正，实际为负

FN (False Negative)假阴性：猜想与负、实际为正

TN (True Negative)真阴性：猜想为负、实际也为负

$准确率=\frac{TP}{TP+FP}$ (3)

$召回率=\frac{TP}{TP+FN}$ (4)

$F1-Score=2*\frac{准确率*召回率}{准确率+召回率}$ (5)

Table 4. Results of using only CTR (Click Through Rate)

Table 5. Result after adding semantic similarity

Figure 7. Only click-through rate and similarity were used for the model comparison

4. 结语

