JISE

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Journal of Information Science and Engineering, Vol. 35 No. 3, pp. 651-674

Sentence-Ranking-Enhanced Keywords Extraction from Chinese Patents

ZHI-HONG WANG¹ AND YI GUO^1,2,3,+
¹Department of Computer Science and Engineering
East China University of Science and Technology
Shanghai, 200237 P.R. China
²Business Intelligence and Visualization Research Center
National Engineering Laboratory for Big Data Distribution and Exchange Technologies
Shanghai, 200436 P.R. China
³School of Information Science and Technology
Shihezi University
Shihezi, 8320003 P.R. China

Patent keywords, a high-level topic representation of patents, hold an important position in many patent-oriented mining tasks, such as classification, retrieval and translation. However, there are few studies concentrated on keywords extraction for patents in current stage, and neither exist human-annotated gold standard datasets, especially for Chinese patents. This paper introduces a new human-annotated Chinese patent dataset and proposes a sentence-ranking based Term Frequency-Inverse Document Frequency (SR based TF-IDF) algorithm for patent keywords extraction, motivated by the thought of “the keywords are in the key sentences”. In the algorithm, a sentence-ranking model is constructed to filter top-KS percent sentences from each patent based on a sentence semantic graph and heuristic rules. At last, the proposed algorithm is evaluated with TF-IDF, TextRank, word2vec weighted TextRank and Patent Keyword Extraction Algorithm (PKEA) on the homemade Chinese patent dataset and several standard benchmark datasets. The experimental results testify that our proposed algorithm effectively improves the performance of extracting keywords from Chinese patents.

Keywords: Chinese patents, key sentences, sentence-ranking model, keywords extraction, human-annotated dataset

Retrieve PDF document (JISE_201903_10.pdf)