运用Python的LDA主题建模(附链接)
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">本文为<span style="color: black;">大众</span>介绍了主题建模的概念、LDA算法的原理,示例了<span style="color: black;">怎样</span><span style="color: black;">运用</span>Python<span style="color: black;">创立</span>一个<span style="color: black;">基本</span>的LDA主题模型,并<span style="color: black;">运用</span>pyLDAvis对主题进行可视化。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/45066ea37b034cfaad11eaa7d0aeb239~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=7n0g5YOzSl342Nic%2Bbam5lcnNsI%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span><span style="color: black;">源自</span>:Kamil Polak</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">引言</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">主题建模<span style="color: black;">包含</span>从文档术语中提取特征,并<span style="color: black;">运用</span>数学结构和框架(如矩阵分解和奇异值分解)来生成彼此可区分的术语聚类(cluster)或组,这些单词聚类继而形成主题或概念。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">主题建模是一种对文档进行无监督<span style="color: black;">归类</span>的<span style="color: black;">办法</span>,类似于对数值数据进行聚类。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这些概念<span style="color: black;">能够</span>用来解释语料库的主题,<span style="color: black;">亦</span><span style="color: black;">能够</span>在<span style="color: black;">各样</span>文档中一同频繁<span style="color: black;">显现</span>的单词之间<span style="color: black;">创立</span>语义联系。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">主题建模<span style="color: black;">能够</span>应用于以下方面:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">发掘</span>数据集中<span style="color: black;">隐匿</span>的主题;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">将文档<span style="color: black;">归类</span>到<span style="color: black;">已然</span><span style="color: black;">发掘</span>的主题中;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">运用</span><span style="color: black;">归类</span>来组织/总结/搜索文档。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">有<span style="color: black;">各样</span>框架和算法<span style="color: black;">能够</span>用以<span style="color: black;">创立</span>主题模型:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">潜在语义索引(Latent semantic indexing)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">非负矩阵分解(Non-negative matrix factorization,NMF)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在本文中,我们将重点讨论<span style="color: black;">怎样</span><span style="color: black;">运用</span>Python进行LDA主题建模。<span style="color: black;">详细</span><span style="color: black;">来讲</span>,<span style="color: black;">咱们</span>将讨论:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">什么是潜在狄利克雷分配(LDA, Latent Dirichlet allocation);</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">LDA算法<span style="color: black;">怎样</span>工作;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">怎样</span><span style="color: black;">运用</span>Python<span style="color: black;">创立</span>LDA主题模型。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">什么是潜在狄利克雷分配(LDA, Latent Dirichlet allocation)?</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">潜在狄利克雷分配(LDA, Latent Dirichlet allocation)是一种生成概率模型(generative probabilistic model),该模型假设<span style="color: black;">每一个</span>文档<span style="color: black;">拥有</span>类似于概率潜在语义索引模型的主题的组合。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">简而言之,LDA<span style="color: black;">背面</span>的思想是,<span style="color: black;">每一个</span>文档<span style="color: black;">能够</span><span style="color: black;">经过</span>主题的分布来描述,<span style="color: black;">每一个</span>主题<span style="color: black;">能够</span><span style="color: black;">经过</span>单词的分布来描述。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">LDA算法<span style="color: black;">怎样</span>工作?</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">LDA由两部分<span style="color: black;">构成</span>:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>已知的属于文件的单词;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">需要计算的属于一个主题的单词或属于一个主题的单词的概率。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">重视</span>:LDA不关心文档中单词的<span style="color: black;">次序</span>。<span style="color: black;">一般</span>,LDA<span style="color: black;">运用</span>词袋特征(bag-of-word feature)<span style="color: black;">暗示</span>来<span style="color: black;">表率</span>文档。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">以下<span style="color: black;">过程</span>非常简单地解释了LDA算法的工作原理:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">1. <span style="color: black;">针对</span><span style="color: black;">每一个</span>文档,随机将<span style="color: black;">每一个</span>单词初始化为K个主题中的一个(事先<span style="color: black;">选取</span>K个主题);</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">2. <span style="color: black;">针对</span><span style="color: black;">每一个</span>文档D,浏览<span style="color: black;">每一个</span>单词w并计算:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">P(T | D):文档D中,指定给主题T的单词的比例;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">P(W | T):所有<span style="color: black;">包括</span>单词W的文档中,指定给主题T的比例。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">3. <span style="color: black;">思虑</span>所有其他单词及其主题分配,以概率P(T | D)´ P(W | T) 将单词W与主题T重新分配。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">LDA主题模型的图示如下。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/6828a9c7d77344a58386800eea910eb3~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=LyFenuE1V2FNd8sMGyDcY9qNf4w%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span><span style="color: black;">源自</span>:Wiki</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/1b32cafffbc04912967cd01a744ca149~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=kf2xHotBBByF%2FuvE%2BKi0twFvkrA%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">下图直观地展示了<span style="color: black;">每一个</span>参数<span style="color: black;">怎样</span>连接回文本文档和术语。假设<span style="color: black;">咱们</span>有M个文档,文档中有N个单词,<span style="color: black;">咱们</span>要生成的主题总数为K。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图中的黑盒<span style="color: black;">表率</span>核心算法,它利用前面<span style="color: black;">说到</span>的参数从文档中提取K个主题。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/cd3adae69d0748819bda628448c89354~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=QGHGXDpgGvLaXuL5r2E65OK1jrU%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span><span style="color: black;">源自</span>:Christine Doig</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">怎样</span><span style="color: black;">运用</span>Python<span style="color: black;">创立</span>LDA主题模型</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>将<span style="color: black;">运用</span>Gensim包中的潜在狄利克雷分配(LDA)。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">首要</span>,<span style="color: black;">咱们</span>需要导入包。核心包是re、gensim、spacy和pyLDAvis。<span style="color: black;">另外</span>,<span style="color: black;">咱们</span>需要<span style="color: black;">运用</span>matplotlib、numpy和panases以进行数据处理和可视化。</span></p><span style="color: black;">1</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">re</span>
<span style="color: black;">2</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">numpy</span> <span style="color: black;">as</span> <span style="color: black;">np</span>
<span style="color: black;">3</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">pandas</span> <span style="color: black;">as</span> <span style="color: black;">pd</span>
<span style="color: black;">4</span><span style="color: black;">.</span> <span style="color: black;">from</span> <span style="color: black;">pprint</span> <span style="color: black;">import</span> <span style="color: black;">pprint</span>
<span style="color: black;">5</span><span style="color: black;">.</span>
<span style="color: black;">6</span><span style="color: black;">.</span> <span style="color: black;"># Gensim</span>
<span style="color: black;">7</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">gensim</span>
<span style="color: black;">8</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">gensim.corpora</span> <span style="color: black;">as</span> <span style="color: black;">corpora</span>
<span style="color: black;">9</span><span style="color: black;">.</span> <span style="color: black;">from</span> <span style="color: black;">gensim.utils</span> <span style="color: black;">import</span> <span style="color: black;">simple_preprocess</span>
<span style="color: black;">10</span><span style="color: black;">.</span> <span style="color: black;">from</span> <span style="color: black;">gensim.models</span> <span style="color: black;">import</span> <span style="color: black;">CoherenceModel</span>
<span style="color: black;">11</span><span style="color: black;">.</span>
<span style="color: black;">12</span><span style="color: black;">.</span> <span style="color: black;"># spacy for lemmatization</span>
<span style="color: black;">13</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">spacy</span>
<span style="color: black;">14</span><span style="color: black;">.</span>
<span style="color: black;">15</span><span style="color: black;">.</span> <span style="color: black;"># Plotting tools</span>
<span style="color: black;">16</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">pyLDAvis</span>
<span style="color: black;">17</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">pyLDAvis.gensim</span> <span style="color: black;"># dont skip this</span>
<span style="color: black;">18</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">matplotlib.pyplot</span> <span style="color: black;">as</span> <span style="color: black;">plt</span>
<span style="color: black;">19</span><span style="color: black;">.</span> <span style="color: black;">%matplotlib</span> <span style="color: black;">inline</span>
<span style="color: black;">20</span><span style="color: black;">.</span>
<span style="color: black;">21</span><span style="color: black;">.</span> <span style="color: black;"># Enable logging for gensim - optional</span>
<span style="color: black;">22</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">logging</span>
<span style="color: black;">23</span><span style="color: black;">.</span> <span style="color: black;">logging.basicConfig(format=%(asctime)s</span> <span style="color: black;">:</span> <span style="color: black;">%(levelname)s</span> <span style="color: black;">:</span> <span style="color: black;">%(message)s,</span> <span style="color: black;">level=logging.ERROR)</span>
<span style="color: black;">24</span><span style="color: black;">.</span>
<span style="color: black;">25</span><span style="color: black;">.</span> <span style="color: black;">import</span> <span style="color: black;">warnings</span>
<span style="color: black;">26</span><span style="color: black;">.</span> <span style="color: black;">warnings.filterwarnings("ignore",category=DeprecationWarning)</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">像am/is/are/of/a/the/but/…<span style="color: black;">这般</span>的词不<span style="color: black;">包括</span>任何关于“主题”的信息。<span style="color: black;">因此呢</span>,<span style="color: black;">做为</span>预处理<span style="color: black;">过程</span>,<span style="color: black;">咱们</span><span style="color: black;">能够</span>将它们从文档中移除。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">要做到这一点,<span style="color: black;">咱们</span>需要从NLT导入停用词。还<span style="color: black;">能够</span><span style="color: black;">经过</span>添加<span style="color: black;">有些</span>额外的单词来扩展原始的停用词列表。</span></p><span style="color: black;">1.</span><span style="color: black;"># NLTK Stop words</span>
<span style="color: black;">2.</span> <span style="color: black;">from</span> nltk.corpus <span style="color: black;">import</span> stopwords
<span style="color: black;">3.</span> stop_words = stopwords.words(<span style="color: black;">english</span>)
<span style="color: black;">4.</span> stop_words.extend([<span style="color: black;">from</span>, <span style="color: black;">subject</span>, <span style="color: black;">re</span>, <span style="color: black;">edu</span>, <span style="color: black;">use</span>])<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在本教程中,<span style="color: black;">咱们</span>将<span style="color: black;">运用</span>20个<span style="color: black;">资讯</span>组数据集,其中<span style="color: black;">包括</span>来自20个<span style="color: black;">区别</span>主题的大约11k个<span style="color: black;">资讯</span>组帖子。这<span style="color: black;">能够</span><span style="color: black;">做为</span>newsgroups.json<span style="color: black;">得到</span>。</span></p><span style="color: black;">1</span><span style="color: black;">.</span> <span style="color: black;"># Import Dataset</span>
<span style="color: black;">2</span><span style="color: black;">.</span> <span style="color: black;">df</span> <span style="color: black;">=</span> <span style="color: black;">pd.read_json(https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json)</span>
<span style="color: black;">3</span><span style="color: black;">.</span> <span style="color: black;">print(df.target_names.unique())</span>
<span style="color: black;">4</span><span style="color: black;">.</span> <span style="color: black;">df.head()</span>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/d3beae7747bb468fba286b5ff6c9c1c7~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=cmitgrKvhxhhViB1Gm1LWnVhrWk%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">删除电子邮件链接和换行符</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在<span style="color: black;">咱们</span><span style="color: black;">起始</span>主题建模之前,需要清理数据集。<span style="color: black;">首要</span>,删除电子邮件链接、多余的空格和换行符。</span></p>1. <span style="color: black;"># Convert to list</span>2. data = df.content.values.tolist()<span style="color: black;">3.</span>
<span style="color: black;">4.</span> <span style="color: black;"># Remove Emails</span>
<span style="color: black;">5.</span> <span style="color: black;">data</span> =
<span style="color: black;">6.</span>
<span style="color: black;">7.</span> <span style="color: black;"># Remove new line characters</span>
<span style="color: black;">8.</span> <span style="color: black;">data</span> =
<span style="color: black;">9.</span>
<span style="color: black;">10.</span> <span style="color: black;"># Remove distracting single quotes</span>
<span style="color: black;">11.</span> <span style="color: black;">data</span> =
<span style="color: black;">12.</span>
<span style="color: black;">13.</span> pprint(<span style="color: black;">data</span>[:<span style="color: black;">1</span>])<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/147a8f8339b94ea1a40c06cc670e9a19~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=%2B0yZjApehoWe3VAYSjzCzbd1urE%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">标记(tokenize)单词和清理文本</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">让<span style="color: black;">咱们</span>把<span style="color: black;">每一个</span>句子标记成一个单词列表,去掉标点符号和不必要的字符。</span></p><span style="color: black;">1</span><span style="color: black;">.</span> <span style="color: black;">def</span> <span style="color: black;">sent_to_words(sentences):</span>
<span style="color: black;">2. for sentence in sentences:</span>
<span style="color: black;">3</span><span style="color: black;">.</span> <span style="color: black;">yield(gensim.utils.simple_preprocess(str(sentence),</span> <span style="color: black;">deacc=True))</span> <span style="color: black;"># deacc=True removes punctuations</span>
<span style="color: black;">4</span><span style="color: black;">.</span>
<span style="color: black;">5</span><span style="color: black;">.</span> <span style="color: black;">data_words</span> <span style="color: black;">=</span> <span style="color: black;">list(sent_to_words(data))</span>
<span style="color: black;">6</span><span style="color: black;">.</span>
<span style="color: black;">7</span><span style="color: black;">.</span> <span style="color: black;">print(data_words[:1])</span>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/3ab060a6449f4ef18d8444067b341ec9~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=qc6ApA2QCOhBqQ9q7FdKCK%2F4n2g%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">创建二元(Bigram)模型和三元(Trigram)模型</span></strong></p><span style="color: black;">1</span><span style="color: black;">.</span> <span style="color: black;"># Build the bigram and trigram models</span>
<span style="color: black;">2</span><span style="color: black;">.</span> <span style="color: black;">bigram</span> <span style="color: black;">=</span> <span style="color: black;">gensim.models.Phrases(data_words,</span> <span style="color: black;">min_count=5,</span> <span style="color: black;">threshold=100)</span> <span style="color: black;"># higher threshold fewer phrases.</span>
<span style="color: black;">3</span><span style="color: black;">.</span> <span style="color: black;">trigram</span> <span style="color: black;">=</span> <span style="color: black;">gensim.models.Phrases(bigram,</span> <span style="color: black;">threshold=100)</span>
<span style="color: black;">4</span><span style="color: black;">.</span>
<span style="color: black;">5</span><span style="color: black;">.</span> <span style="color: black;"># Faster way to get a sentence clubbed as a trigram/bigram</span>
<span style="color: black;">6</span><span style="color: black;">.</span> <span style="color: black;">bigram_mod</span> <span style="color: black;">=</span> <span style="color: black;">gensim.models.phrases.Phraser(bigram)</span>
<span style="color: black;">7</span><span style="color: black;">.</span> <span style="color: black;">trigram_mod</span> <span style="color: black;">=</span> <span style="color: black;">gensim.models.phrases.Phraser(trigram)</span>
<span style="color: black;">8</span><span style="color: black;">.</span>
<span style="color: black;">9</span><span style="color: black;">.</span> <span style="color: black;"># See trigram example</span>
<span style="color: black;">10</span><span style="color: black;">.</span> <span style="color: black;">print(trigram_mod]])</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">删除停用词(stopword),<span style="color: black;">创立</span>二元模型和词形还原(Lemmatize)</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在这一步中,<span style="color: black;">咱们</span>分别定义了函数以删除停止词、<span style="color: black;">创立</span>二元模型和词形还原,并且依次调用了这些函数。</span></p><span style="color: black;">1.</span><span style="color: black;"># Define functions for stopwords, bigrams, trigrams and lemmatization</span>
<span style="color: black;">2.</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">remove_stopwords</span><span style="color: black;">(texts)</span>:</span>
<span style="color: black;">3.</span> <span style="color: black;">return</span> [ <span style="color: black;">for</span> doc <span style="color: black;">in</span> texts]
<span style="color: black;">4.</span>
<span style="color: black;">5.</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">make_bigrams</span><span style="color: black;">(texts)</span>:</span>
<span style="color: black;">6.</span> <span style="color: black;">return</span> <span style="color: black;">for</span> doc <span style="color: black;">in</span> texts]
<span style="color: black;">7.</span>
<span style="color: black;">8.</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">make_trigrams</span><span style="color: black;">(texts)</span>:</span>
<span style="color: black;">9.</span> <span style="color: black;">return</span> ] <span style="color: black;">for</span> doc <span style="color: black;">in</span> texts]
<span style="color: black;">10.</span>
<span style="color: black;">11.</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">lemmatization</span><span style="color: black;">(texts, allowed_postags=[<span style="color: black;">NOUN</span>, <span style="color: black;">ADJ</span>, <span style="color: black;">VERB</span>, <span style="color: black;">ADV</span>])</span>:</span>
<span style="color: black;">12.</span> <span style="color: black;">"""https://spacy.io/api/annotation"""</span>
<span style="color: black;">13.</span> texts_out = []
<span style="color: black;">14.</span> <span style="color: black;">for</span> sent <span style="color: black;">in</span> texts:
<span style="color: black;">15.</span> doc = nlp(<span style="color: black;">" "</span>.join(sent))
<span style="color: black;">16.</span> texts_out.append()<span style="color: black;">17.</span> <span style="color: black;">return</span> texts_out
<span style="color: black;">1.</span> <span style="color: black;"># Remove Stop Words</span>
<span style="color: black;">2.</span> data_words_nostops = remove_stopwords(data_words)
<span style="color: black;">3.</span>
<span style="color: black;">4.</span> <span style="color: black;"># Form Bigrams</span>
<span style="color: black;">5.</span>data_words_bigrams = make_bigrams(data_words_nostops)<span style="color: black;">6.</span>
<span style="color: black;">7.</span> <span style="color: black;"># Initialize spacy en model, keeping only tagger component (for efficiency)</span>
<span style="color: black;">8.</span> <span style="color: black;"># python3 -m spacy download en</span>
<span style="color: black;">9.</span> nlp = spacy.load(<span style="color: black;">en</span>, disable=[<span style="color: black;">parser</span>, <span style="color: black;">ner</span>])
<span style="color: black;">10.</span>
<span style="color: black;">11.</span> <span style="color: black;"># Do lemmatization keeping only noun, adj, vb, adv</span>
<span style="color: black;">12.</span> data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[<span style="color: black;">NOUN</span>, <span style="color: black;">ADJ</span>, <span style="color: black;">VERB</span>, <span style="color: black;">ADV</span>])
<span style="color: black;">13.</span>
<span style="color: black;">14.</span> print(data_lemmatized[:<span style="color: black;">1</span>])<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p26-sign.toutiaoimg.com/pgc-image/b4551091bb4242a6bbae08eb205f76e8~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=Nt6wblJ%2BK8KrhPOpUeU9d8tzgvo%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">创建主题建模所需的词典和语料库(corpus)</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Gensim为文档中的<span style="color: black;">每一个</span>单词创建一个<span style="color: black;">独一</span>的id,<span style="color: black;">然则</span><span style="color: black;">这里</span>之前,<span style="color: black;">咱们</span>需要创建一个字典和语料库<span style="color: black;">做为</span>模型的输入。</span></p><span style="color: black;">1</span><span style="color: black;">.</span> <span style="color: black;"># Create Dictionary</span>
<span style="color: black;">2</span><span style="color: black;">.</span> <span style="color: black;">id2word</span> <span style="color: black;">=</span> <span style="color: black;">corpora.Dictionary(data_lemmatized)</span>
<span style="color: black;">3</span><span style="color: black;">.</span>
<span style="color: black;">4</span><span style="color: black;">.</span> <span style="color: black;"># Create Corpus</span>
<span style="color: black;">5</span><span style="color: black;">.</span> <span style="color: black;">texts</span> <span style="color: black;">=</span> <span style="color: black;">data_lemmatized</span>
<span style="color: black;">6</span><span style="color: black;">.</span>
<span style="color: black;">7</span><span style="color: black;">.</span> <span style="color: black;"># Term Document Frequency</span>
<span style="color: black;">8</span><span style="color: black;">.</span> <span style="color: black;">corpus</span> <span style="color: black;">=</span> <span style="color: black;"></span>
<span style="color: black;">9</span><span style="color: black;">.</span>
<span style="color: black;">10</span><span style="color: black;">.</span> <span style="color: black;"># View</span>
<span style="color: black;">11</span><span style="color: black;">.</span> <span style="color: black;">print(corpus[:1])</span>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/24898723f7034e62ad4adc90e723b576~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=D%2BdxU14gWCykVFhAouYEqNDbZnA%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">创立</span>主题模型</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">此刻</span><span style="color: black;">咱们</span>准备进入核心<span style="color: black;">过程</span>,<span style="color: black;">运用</span>LDA进行主题建模。让<span style="color: black;">咱们</span><span style="color: black;">起始</span><span style="color: black;">创立</span>模型。<span style="color: black;">咱们</span>将<span style="color: black;">创立</span>20个<span style="color: black;">区别</span>主题的LDA模型,其中<span style="color: black;">每一个</span>主题都是关键字的组合,<span style="color: black;">每一个</span>关键字在主题中都<span style="color: black;">拥有</span><span style="color: black;">必定</span>的权重(weightage)。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">有些</span>参数的解释如下:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">num_topics —需要预先定义的主题数量;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">chunksize — <span style="color: black;">每一个</span>训练块(training chunk)中要<span style="color: black;">运用</span>的文档数量;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">alpha — 影响主题稀疏性的超参数;</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">passess — 训练<span style="color: black;">评定</span>的总数。</span></p><span style="color: black;">1</span><span style="color: black;">.</span> <span style="color: black;"># Build LDA model</span>
<span style="color: black;">2</span><span style="color: black;">.</span> <span style="color: black;">lda_model</span> <span style="color: black;">=</span> <span style="color: black;">gensim.models.ldamodel.LdaModel(corpus=corpus,</span>
<span style="color: black;">3</span><span style="color: black;">.</span> <span style="color: black;">id2word=id2word,</span>
<span style="color: black;">4</span><span style="color: black;">.</span> <span style="color: black;">num_topics=20,</span>
<span style="color: black;">5</span><span style="color: black;">.</span> <span style="color: black;">random_state=100,</span>
<span style="color: black;">6</span><span style="color: black;">.</span> <span style="color: black;">update_every=1,</span>
<span style="color: black;">7</span><span style="color: black;">.</span> <span style="color: black;">chunksize=100,</span>
<span style="color: black;">8</span><span style="color: black;">.</span> <span style="color: black;">passes=10,</span>
<span style="color: black;">9</span><span style="color: black;">.</span> <span style="color: black;">alpha=auto,</span>
<span style="color: black;">10</span><span style="color: black;">.</span> <span style="color: black;">per_word_topics=True)</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">查看LDA模型中的主题</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">能够</span>可视化<span style="color: black;">每一个</span>主题的关键词和<span style="color: black;">每一个</span>关键词的权重(重要性)。</span></p><span style="color: black;">1</span><span style="color: black;">.#</span> <span style="color: black;">Print</span> <span style="color: black;">the</span> <span style="color: black;">Keyword</span> <span style="color: black;">in</span> <span style="color: black;">the</span> <span style="color: black;">10</span> <span style="color: black;">topics</span>
<span style="color: black;">2</span><span style="color: black;">.</span> <span style="color: black;">pprint(lda_model.print_topics())</span>
<span style="color: black;">3</span><span style="color: black;">.</span> <span style="color: black;">doc_lda</span> <span style="color: black;">=</span> <span style="color: black;">lda_model</span>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/097662ead81d475ebe3b8df08f148e61~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=TTdcTtvziEsE2XCxorJXpRKohsQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/947e2901064a48768220c5ae3ea84d7f~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=y%2BORahQDi%2FIU%2F0wPu%2FhQtYos3ew%3D" style="width: 50%; margin-bottom: 20px;"></div>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/f37b901ccaf54c87b917d11036fcbb41~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=gzrcEFXEonIK5xpXdC9xicw83rg%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">计算模型困惑度(Perplexity)和一致性分数(Coherence Score)</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">模型困惑度是对概率分布或概率模型预测样本好坏的一种度量。主题一致性<span style="color: black;">经过</span><span style="color: black;">测绘</span>主题中得分高的单词之间的语义<span style="color: black;">类似</span>度来衡量单个主题的得分。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">简而言之,它们<span style="color: black;">供给</span>了一种方便的<span style="color: black;">办法</span>来判断一个给定的主题模型有多好。</span></p>1. <span style="color: black;"># Compute Perplexity</span>
2. <span style="color: black;">print</span>(<span style="color: black;">\nPerplexity: </span>, lda_model.log_perplexity(corpus)) <span style="color: black;"># a measure of how good the model is. lower the better.</span>
3.
4. <span style="color: black;"># Compute Coherence Score</span>
5. coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence=<span style="color: black;">c_v</span>)
6. coherence_lda = coherence_model_lda.get_coherence()
7.<span style="color: black;">print</span>(<span style="color: black;">\nCoherence Score: </span>, coherence_lda)<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/aa94e62454c742df876a96cabcff421e~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=BViH7kkbBf6%2BBMAQg9dxOtfo6rw%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">可视化主题-关键词</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">此刻</span>,<span style="color: black;">咱们</span><span style="color: black;">能够</span><span style="color: black;">检测</span>生成的主题和<span style="color: black;">关联</span>的关键词。最好的<span style="color: black;">办法</span>是<span style="color: black;">运用</span>pyLDAvis可视化<span style="color: black;">咱们</span>的模型。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">pyLDAvis旨在<span style="color: black;">帮忙</span>用户在一个适合文本数据语料库的主题模型中解释主题。它从拟合好的的线性判别分析主题模型(LDA)中提取信息,以实现基于网络的交互式可视化。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">1. # Visualize the topics2. pyLDAvis.enable_notebook()3. vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)4. vis</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/131ad271da6344298e84db8ddfa79352~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=rAJ0YQyLacl1vm4SXuafezIxZqg%3D" style="width: 50%; margin-bottom: 20px;"></div>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/0f3fb8e48f62477eb143e5a57b9b9cdb~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1728117795&x-signature=Ob9ZA%2BCAOoFdbbKw7HUiutEMBd4%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">至此,<span style="color: black;">咱们</span>成功<span style="color: black;">创立</span>了一个可观的主题模型!</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">简要地解释一下结果:左手边的<span style="color: black;">每一个</span>气泡<span style="color: black;">表率</span>一个<span style="color: black;">专题</span>。气泡越大,该主题就越盛行。<span style="color: black;">按照</span>经验,一个好的主题模型会有大的、不重叠的气泡。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">亦</span><span style="color: black;">能够</span>点击右边的侧边工具条,以<span style="color: black;">调节</span>阿尔法(alpha)参数。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">结语</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">主题建模是自然语言处理的<span style="color: black;">重点</span>应用之一。本文的目的是解释什么是主题建模,以及<span style="color: black;">怎样</span>在<span style="color: black;">实质</span><span style="color: black;">运用</span>中实现潜在狄利克雷分配(LDA)模型。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为此,<span style="color: black;">咱们</span>深入研</span><span style="color: black;">究了LDA的原理,<span style="color: black;">运用</span>Gensim包中的LDA构建了一个<span style="color: black;">基本</span>的主题模型,并<span style="color: black;">运用</span>pyLDAvis对主题进行了可视化。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">期盼</span>您<span style="color: black;">爱好</span>该文并有所收获。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">References:</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Jelodar, H., Wang, Y., Yuan, C. et al. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78, 15169–15211 (2019). https://doi.org/10.1007/s11042-018-6894-4</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://jovian.ai/outlink?url=https%3A%2F%2Fdoi.org%2F10.1007%2Fs11042-018-6894-4</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">D. Sarkar, Text Analytics with Python. A Practical Real-World Approach to Gaining Actionable Insights from Your Data</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://jovian.ai/outlink?url=https%3A%2F%2Fwww.machinelearningplus.com%2Fnlp%2Ftopic-modeling-gensim-python%2F</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://jovian.ai/outlink?url=https%3A%2F%2Ftowardsdatascience.com%2Ftopic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://jovian.ai/outlink?url=https%3A%2F%2Ftowardsdatascience.com%2Fend-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">编辑:王菁</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">校对:林<span style="color: black;">也</span>霖</span></p>
同意、说得对、没错、我也是这么想的等。 太棒了、厉害、为你打call、点赞、非常精彩等。
页:
[1]