Python爬虫教程,爬取网易云的音乐
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在<span style="color: black;">起始</span>之前,做一点小小的说明哈:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">我只是一个python爬虫<span style="color: black;">兴趣</span>者,<span style="color: black;">倘若</span>本文有侵权,请联系我删除!本文需要有简单的python爬虫<span style="color: black;">基本</span>,<span style="color: black;">重点</span>用到两个爬虫模块(都是常规的)requests模块selenium模块<span style="color: black;">意见</span><span style="color: black;">运用</span>谷歌浏览器,方便进行抓包和数据获取。</p><strong style="color: blue;"><span style="color: black;">私信<span style="color: black;">博主</span>01<span style="color: black;">就可</span>获取<span style="color: black;">海量</span>Python学习资料</span></strong>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">Part1</span><span style="color: black;">进行网页分析</span></h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">首要</span>打开网易云的网页版网易云</p><span style="color: black;">而后</span>搜索歌曲,<span style="color: black;">这儿</span>我就搜索一首锦零的“空山新雨后”
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/f04c833fb57f4e5f9cd33c65f98a8c2d~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=wmRTqZUoa9U3IPH8aB0s41PKJo4%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">此时</span><span style="color: black;">咱们</span>来观察网页的url,<span style="color: black;">能够</span><span style="color: black;">发掘</span>s=后面<span style="color: black;">便是</span><span style="color: black;">咱们</span>搜索的关键字</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/620805a2c2a243779883237a19fbc2b6~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=zTNEMr8LqDDCc5OOB%2FMnUbgpQmE%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">当<span style="color: black;">咱们</span>换一首歌,会<span style="color: black;">发掘</span><span style="color: black;">亦</span>是<span style="color: black;">这般</span>的,正好验证了<span style="color: black;">咱们</span>的想法</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/a775d722db664853af13e18cd9a49283~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=mVGqlIT6clhqiDlM06r70myz5V8%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">因此</span>下一步让<span style="color: black;">咱们</span>点进去一首歌,<span style="color: black;">而后</span>进行播放,<span style="color: black;">瞧瞧</span>能否直接获取音乐文件的url,<span style="color: black;">倘若</span>能,<span style="color: black;">那样</span>直接对url进行requests.get<span style="color: black;">拜访</span>,<span style="color: black;">咱们</span>就能拿到.mp3文件了</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">点进<span style="color: black;">第1</span>首“空山新雨后”,<span style="color: black;">咱们</span><span style="color: black;">能够</span>看到有一个“生成外链播放器”</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/364cb78c77c949ecbacb13e6d3776727~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=2eMWhDuHUQgWf956IZ9IIe9ukNg%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">看到这个,我心中一阵激动,仿佛就要大功告成;于是我满怀开心的点了一下,结果。。。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/75a6123ac70a4f7899f33b876b4388d1~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=AyeEdfH8jo5J5AzVdEPT4%2ByVTG0%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">好吧,<span style="color: black;">不外</span><span style="color: black;">咱们</span><span style="color: black;">不可</span>放弃,来<span style="color: black;">咱们</span>分析一下网页</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">但当<span style="color: black;">咱们</span>定位到两个最有可能<span style="color: black;">显现</span>外链的<span style="color: black;">地区</span>时,<span style="color: black;">发掘</span>什么都<span style="color: black;">无</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/1526a6cd61a74940894bdfd0b4cdfdd3~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=N7WHxduuPTKh3%2FOgILo3MtzUSrA%3D" style="width: 50%; margin-bottom: 20px;"></div>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/b7f7db8fbbd64f4bad9f9a50c0d77052~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=MAKF%2FkZvK2W3y0ID7TKGODmZzVo%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">不外</span><span style="color: black;">做为</span>“规格严格,功夫到家”的传承者,我<span style="color: black;">不可</span>放弃啊,于是我又打开了抓包工具</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">根据</span>常规<span style="color: black;">招数</span>,<span style="color: black;">咱们</span>定位到XHR</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/ed22d469639a406d8c1ef0325c75e49a~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=QgXEsgmwhiuBGxCjzIRtWVWPzko%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">点击播放后,<span style="color: black;">显现</span>了一大堆东西,<span style="color: black;">咱们</span>要做的<span style="color: black;">便是</span>找到其中的content-type为audio一类的包</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">功夫不负有心人,在寻找了一(亿)会儿后,我找到了</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/db5845fac4ed4b24aee34ad100e94483~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=H36Dk8PnaG0FLVwkp33gcBtRjUM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/86142f342ea1447b8704611b2652858c~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=DsQx4FJRNTUTWmBuDS6ikD1EIwo%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">于是我又满怀开心的复制了这个包对应的Request-URL</p>粘贴后<span style="color: black;">拜访</span>这个url,结果非常满意,这<span style="color: black;">便是</span>我<span style="color: black;">始终</span>在找的url
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/7c8e85b1a9c14b4d815427d3baabd415~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=NCE2eLoHDXElRmUY5yHfcr1PC3E%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">此刻</span>我把那个url贴出来</span></p><span style="color: black;">https</span>:<span style="color: black;">//m10.music.126.net/20200715163315/a075d787d191f6729a517527d6064f59/ymusic/0552/0f0e/530f/28d03e94478dcc3e0479de4b61d224e9.mp3</span>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">Part2 </span><span style="color: black;">编写爬虫程序</span></h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">接下来就超级简单了</p>下面的代码是最常规的操作,应该有爬虫<span style="color: black;">基本</span>的都能看懂;<span style="color: black;">倘若</span>有不懂的,注释都在上面
<span style="color: black;">#导入requests包</span>
<span style="color: black;">import</span> requests
<span style="color: black;">#进行UA伪装</span>
headers = {
<span style="color: black;">User-Agent</span>:<span style="color: black;">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36</span>
}
<span style="color: black;">#指定url</span>
url = <span style="color: black;">https://m10.music.126.net/20200715163315/a075d787d191f6729a517527d6064f59/ymusic/0552/0f0e/530f/28d03e94478dcc3e0479de4b61d224e9.mp3</span>
<span style="color: black;">#调用requests.get<span style="color: black;">办法</span>对url进行<span style="color: black;">拜访</span>,和持久化存储数据</span>audio_content = requests.get(url=url,headers=headers).content<span style="color: black;">#存入本地</span>
<span style="color: black;">with</span> open(<span style="color: black;">空山新雨后.mp3</span>,<span style="color: black;">wb</span>) <span style="color: black;">as</span> f :
f.write(audio_content)
print(<span style="color: black;">"空山新雨后爬取成功!!!"</span>)<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">Part3 </span><span style="color: black;">更高级的</span></h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">看到<span style="color: black;">这儿</span>,你可能会想,为啥<span style="color: black;">基本</span>没用selenium模块呢?能<span style="color: black;">不可</span>直接爬取任何一首我想要的歌,而<span style="color: black;">不消</span>每首都去费心费力的找一个url呢?当然<span style="color: black;">能够</span>哒!</p>其实网易云在线播放每首歌曲时,都有一个外链<span style="color: black;">位置</span>,这是不会变的,跟每首歌的<span style="color: black;">独一</span>一个id绑定在<span style="color: black;">一块</span>,每首歌audio文件的url如下:
<span style="color: black;">url</span> = <span style="color: black;">http://music.163.com/song/media/outer/url?id=</span> + 歌曲的id值 + <span style="color: black;">.mp3</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">id值的获取<span style="color: black;">亦</span>很简单,当<span style="color: black;">咱们</span>点进每首歌时,上方会<span style="color: black;">显现</span>对应的网址,那里有id值,如下图:</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/51ba1655690f4458ab5aeca5eb94248f~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1729477209&x-signature=zDYMA9U%2FNXvl4Q3JBi%2BH5lIGqqM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">因此</span>只需把上面程序中的url改成新的url<span style="color: black;">就可</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">倘若</span>还想要更好的体验效果,实<span style="color: black;">此刻</span>程序里直接搜索歌曲,拿到id值,就需要用到selenium模块</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为何</span>用selenium而<span style="color: black;">不消</span>xpath或bs4?</p><span style="color: black;">由于</span>搜索页面的数据是动态加载出来的,<span style="color: black;">倘若</span>直接对搜索页面的网页进行数据解析,就拿不到任何数据;以我<span style="color: black;">日前</span>的技术,就只能想到<span style="color: black;">运用</span>万能的selenium模块,下面大概说明一下<span style="color: black;">过程</span>:
进行selenium无可视化界面设置<span style="color: black;">from</span>selenium.webdriver.chrome.options<span style="color: black;">import</span> Options
chrome_options = Options()
chrome_options.add_argument(<span style="color: black;">--headless</span>)
chrome_options.add_argument(<span style="color: black;">--disable-gpu</span>)导包<span style="color: black;">import</span> requests
<span style="color: black;">import</span> re
from selenium <span style="color: black;">import</span> webdriver
from time <span style="color: black;">import</span> sleep指定歌曲,得到对应搜索页面的url<span style="color: black;">name</span> = input(<span style="color: black;">请输入歌名:</span>)
<span style="color: black;">url_1</span> = <span style="color: black;">https://music.163.com/#/search/m/?s=</span> + name + <span style="color: black;">&type=1</span>获取搜索页面的html文件<span style="color: black;">#初始化browser对象</span>
browser = webdriver.Chrome(executable_path=<span style="color: black;">chromedriver.exe</span>,chrome_options=chrome_options)<span style="color: black;">#<span style="color: black;">拜访</span>该url</span>
browser.<span style="color: black;">get</span>(url=url_1)
<span style="color: black;">#<span style="color: black;">因为</span>网页中有iframe框架,进行切换</span>
browser.switch_to.frame(<span style="color: black;">g_iframe</span>)
<span style="color: black;">#等待0.5秒</span>
sleep(<span style="color: black;">0.5</span>)
<span style="color: black;">#抓取到页面信息</span>
page_text = browser.execute_script(<span style="color: black;">"return document.documentElement.outerHTML"</span>)
<span style="color: black;">#退出浏览器</span>
browser.quit()用正则模块re匹配html文件中的id值、歌名和歌手<span style="color: black;">ex1</span> = <span style="color: black;"><a.*?id="(*?)"</span>
<span style="color: black;">ex2</span> = <span style="color: black;"><b.*?title="(.*?)"><span class="s-fc7"></span>
<span style="color: black;">ex3</span> = <span style="color: black;">class="td w1"><div.*?class="text"><a.*?href=".*?">(.*?)</a></div></div></span>
<span style="color: black;">id_list</span> = re.findall(ex1,page_text,re.M)[::<span style="color: black;">2</span>]
<span style="color: black;">song_list</span>= re.findall(ex2,page_text,re.M)<span style="color: black;">singer_list</span> = re.findall(ex3,page_text,re.M)将id值、歌名和歌手封装成一个个元组,写入一个列表中,再进行打印li = list(zip(song_list,singer_list,id_list))
<span style="color: black;">for</span> i in <span style="color: black;">range</span>(<span style="color: black;">len</span>(li)):
<span style="color: black;">print</span>(str(i+<span style="color: black;">1</span>) + <span style="color: black;">.</span> + str(li),end=<span style="color: black;">\n</span>)对满意的id值可得到一个url,再用上面的程序对该url进行requests.get<span style="color: black;">办法</span><span style="color: black;">拜访</span><span style="color: black;">就可</span>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">Part4 </span><span style="color: black;">小结</span></h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">终究是我才疏学浅,这个找外链进行爬取的<span style="color: black;">办法</span><span style="color: black;">亦</span>存在<span style="color: black;">非常多</span>不足,<span style="color: black;">例如</span><span style="color: black;">不可</span>在线播放的歌曲是<span style="color: black;">没法</span>下载的。</p><span style="color: black;">不外</span>写这样一个小程序练练手,对自己能力的<span style="color: black;">加强</span>确是有<span style="color: black;">极重</span><span style="color: black;">帮忙</span>的。
你的见解真是独到,让我受益良多。
页:
[1]