基于文本相似度的搜刮推荐点击猜想模型Improvement of the Recommended Click Prediction Model Based on Text Similarity

• 全文下载: PDF(1141KB)    PP.613-621   DOI: 10.12677/CSA.2019.93069
• 下载量: 277  浏览量: 1,427   科研立项经费支撑

In order to further improve the accuracy of the recommended content click prediction in search engine, a method based on the similarity feature of search content is proposed. The structure of the method is composed of multiple decision tree models. The Hierarchical softmax is used to convert the result to binary classification results. In order to understand the semantics of the user’s search text, the user input text similarity with the recommended content title, high-frequency related content, and recommended content tags is used to increase the accuracy of the click prediction model. The segmentation of words in the text is performed using the jieba segmentation, and word2cev is used to train all the words and construct a word vector model. Finally, Light GBM is used to build the prediction model. Then, 50,000 of the 2.05 million users’ search records are taken as the verification set and the rest as the training set. Experimental results show that the accuracy of the model is improved after adding similarity features.

1. 引言

2. 相干任务

2.1. 分词

2.2. 词向量

2.3. 神经搜集说话模型(NNLM)

NNLM (Neural network language model)是在2003年由Bengio提出来的 [9] ，如今已被广泛应用语音辨认体系 [10] 和高低文分析 [11] 等。NNLM的道理是应用前n个词来猜想最后一个词。神经搜集模型应用Distributed Representation词向量来表示词语并作为输入。神经搜集模型分为4层：输入层、投影层、隐蔽层、输入层(如图1所示)。

Figure 1. Neural network language model structure diagram

$f\left({w}_{t},{w}_{t-1},\cdots ,{w}_{t-n+2},{w}_{t-n+1}\right)=p\left({w}_{t}|{w}_{1}^{t-1}\right)$ (1)

$L=\frac{1}{T}\underset{t}{\sum }\mathrm{log}f\left({w}_{t},{w}_{t-1},\cdots ,{w}_{t-n+2},{w}_{t-n+1};\theta \right)+R\left(\theta \right)$ (2)

2.4. word2vec

word2vec是谷歌公司在2013年开放的一款词向量练习模型，可以根据给定的语料库，经过过程优化后的模型将单词练习成向量的情势。再应用word2vec计算出关键字的语义类似度 [12] 。word2vec依附skip-grams或许CBOW来建立词嵌入。在word2vec中，应用是层次化softmax (Hierarchical softmax)停止归一化 [13] ，改良了传统softamx的运算效力。

CBOW与Skip-Grams

Figure 2. Schematic diagram of CBOW and Skip-gram

3. 本文任务

3.1. 分词纠错

N-gram子串：{[广东，工业，大年夜学]，[工业，大年夜学，是]，[大年夜学，是，以]，[是，以，工]，[以，工，为主]，[为主，的，多学科]，[的，多学科，调和]，[多学科，调和，生长]，[调和，生长，的]，[生长，的，大年夜学]}

Figure 3. Substring of n-gram

3.2. 经过过程点击率和词类似度特点构建点击猜想模型

Table 1. Feature

Figure 4. Feature of click rate

Figure 5. Features with added similarity

Figure 6. A model for the Boosting

3.3. 实验成果分析与比较

Table 2. Sample data

Table 3. Data field description

TP (True Positive)真阳性：猜想为正，实际也为正

FP (False Positive)假阳性：猜想为正，实际为负

FN (False Negative)假阴性：猜想与负、实际为正

TN (True Negative)真阴性：猜想为负、实际也为负

$准确率=\frac{TP}{TP+FP}$ (3)

$召回率=\frac{TP}{TP+FN}$ (4)

$F1-Score=2*\frac{准确率*召回率}{准确率+召回率}$ (5)

Table 4. Results of using only CTR (Click Through Rate)

Table 5. Result after adding semantic similarity

Figure 7. Only click-through rate and similarity were used for the model comparison

4. 结语

NOTES

*通信作者。

 [1] 李晓明, 闫宏飞, 王继平易近. 搜刮引擎: 道理、技巧与体系[J]. 2012. [2] Yang, M.C., Lee, D.G., Park, S.Y., et al. (2015) Knowledge-Based Question Answering Using the Semantic Embedding Space. Expert Systems with Applications, 42, 9086-9104. https://doi.org/10.1016/j.eswa.2015.07.009 [3] Joachims, T. (2002) Optimizing Search Engines Using Clickthrough Data. ACM Conference on Knowledge Discovery & Data Mining, Edmonton, 23-26 July 2002, 1-21. [4] Xing, Q., Liu, Y., Nie, J.Y., et al. (2013) Incorporating User Preferences into Click Models. [5] 汉语信息处理词汇01部分: 根本术语(GB12200.1-90)6 [S]. 北京: 中国标准出版社, 1991. [6] Forney, G.D. (1993) The Viterbi Algorithm. Proceedings of the IEEE, 61, 268-278. [7] Hinton, G.E. (1989) Learning Distributed Representations of Concepts. 8th Conference of the Cognitive Science Society, Ann Arbor, 1989, 1-11. [8] Manning, C.D. (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge. [9] Bengio, Y., Schwenk, H., Senécal, J., et al. (2003) Neural Probabilistic Language Models. Journal of Machine Learning Research, 3, 1137-1155. [10] Lee, K., Park, C., Kim, N., et al. (2018) Accelerating Recurrent Neural Network Language Model Based Online Speech Recognition System. [11] Deng, H., Lei, Z. and Wang, L. (2017) Global Context-Dependent Recurrent Neural Network Lan-guage Model with Sparse Feature Learning. Neural Computing & Applications, No. 6, 1-13. [12] Shao, T., Chen, H. and Chen, W. (2018) Query Auto-Completion Based on Word2vec Semantic Similarity. Journal of Physics Conference Series, 1004, Article ID: 012018. https://doi.org/10.1088/1742-6596/1004/1/012018 [13] 周练. Word2vec的任务道理及应用商量[J]. 图书谍报导刊, 2015(2): 145-148. [14] Kearns, M.J. and Valiant, L.G. (1993) Cryptographic Limitations on Learning Boolean Formulae and Finite Automata. Springer-Verlag, Berlin. https://doi.org/10.1007/3-540-56483-7_21 [15] Valiant, L. (2015) Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Common Knowledge, 21, 340. https://doi.org/10.1215/0961754X-2872666 [16] Guolin, K., Qing, M. and Thomas, F. (2017) LightGBM: A Highly Efficient Gradient Boosting Decision Tree. 31st Conference on Neural Information Processing Systems, Long Beach, 2017, 1-11. [17] Shi, H. (2007) Best-First Decision Tree Learning. The University of Waikato, Hillcrest.