带你完成第1个爬虫,简单爬取百度照片
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">大众</span>好,我是润森</p>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">什么是爬虫</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">网络爬虫(又被<span style="color: black;">叫作</span>为网页蜘蛛,网络<span style="color: black;">设备</span>人,在FOAF社区中间,更经常的<span style="color: black;">叫作</span>为网页追逐者),是一种<span style="color: black;">根据</span><span style="color: black;">必定</span>的规则,自动地抓取万维网信息的程序<span style="color: black;">或</span>脚本。<span style="color: black;">另一</span><span style="color: black;">有些</span>不常<span style="color: black;">运用</span>的名字还有蚂蚁、自动索引、模拟程序<span style="color: black;">或</span>蠕虫。(<span style="color: black;">源自</span>: 百度百科)</p>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">爬虫协议</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Robots协议(<span style="color: black;">亦</span><span style="color: black;">叫作</span>为爬虫协议、<span style="color: black;">设备</span>人协议等)的全<span style="color: black;">叫作</span>是“网络爬虫排除标准”(Robots Exclusion Protocol),网站<span style="color: black;">经过</span>Robots协议告诉搜索引擎<span style="color: black;">那些</span>页面<span style="color: black;">能够</span>抓取,<span style="color: black;">那些</span>页面<span style="color: black;">不可</span>抓取。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">robots.txt文件是一个文本文件,<span style="color: black;">运用</span>任何一个<span style="color: black;">平常</span>的文本编辑器,<span style="color: black;">例如</span>Windows系统自带的Notepad,就<span style="color: black;">能够</span>创建和编辑它。robots.txt是一个协议,而不是一个命令。robots.txt是搜索引擎中<span style="color: black;">拜访</span>网站的时候要查看的<span style="color: black;">第1</span>个文件。robots.txt文件告诉蜘蛛程序在服务器上什么文件是<span style="color: black;">能够</span>被查看的。(<span style="color: black;">源自</span>: 百度百科)</p>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">爬虫百度<span style="color: black;">照片</span></h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">目的</span>:爬取百度的<span style="color: black;">照片</span>,并<span style="color: black;">保留</span>电脑中</p>能<span style="color: black;">不可</span>爬?<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">首要</span>数据<span style="color: black;">是不是</span>公开?能<span style="color: black;">不可</span>下载?</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/bd5b7ff8be9e437297fa6de1afa1d789~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=gBTJwQVEak1mzlq6ThNGmxR%2FqoA%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">从图中<span style="color: black;">能够</span>看出,百度的<span style="color: black;">照片</span>是完全<span style="color: black;">能够</span>下载,说明了<span style="color: black;">照片</span><span style="color: black;">能够</span>爬取</p>先爬取一张<span style="color: black;">照片</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">首要</span>,明白<span style="color: black;">照片</span>是什么?</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">有形式的事物,<span style="color: black;">咱们</span>看到的,是图画、照片、拓片等的统<span style="color: black;">叫作</span>。图是技术制图中的<span style="color: black;">基本</span>术语,指用点、线、符号、文字和数字等描绘事物几何特征、形态、位置及<span style="color: black;">体积</span>的一种形式。随着数字采集技术和信号处理理论的发展,越来越多的<span style="color: black;">照片</span>以数字形式存储。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">”</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">而后</span>需要<span style="color: black;">照片</span>在哪里?</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">照片</span>是在云服务器的数据库中的<span style="color: black;">保留</span>起来的</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">”</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">每张<span style="color: black;">照片</span>都有对应的url,<span style="color: black;">经过</span>requests模块来发起请求,在用文件的wb+方式来<span style="color: black;">保留</span>起来</p><span style="color: black;">import</span> requests
r = requests.<span style="color: black;">get</span>(<span style="color: black;">http://pic37.nipic.com/20140113/8800276_184927469000_2.png</span>)
with <span style="color: black;">open</span>(<span style="color: black;">demo.jpg</span>,<span style="color: black;">wb+</span>) <span style="color: black;">as</span> f:
f.write(r.content)批量爬取<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">然则</span>有谁为了爬一张<span style="color: black;">照片</span>去写代码,还不如直接去下载 。爬虫是目的<span style="color: black;">便是</span>为了达到批量下载的目的,这才是真正的爬虫</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">”</p>网站的分析<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">首要</span><span style="color: black;">认识</span>json</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">JSON(JavaScript Object Notation, JS 对象简谱) 是一种轻量级的数据交换格式。它基于 ECMAScript (欧洲计算机协会制定的js规范)的一个子集,采用完全独立于编程语言的文本格式来存储和<span style="color: black;">暗示</span>数据。简洁和清晰的层次结构使得 JSON <span style="color: black;">作为</span>理想的数据交换语言。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">”</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">json <span style="color: black;">便是</span>js 的对象,<span style="color: black;">便是</span>来存取数据的东西</strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">JSON字符串</p>{
“<span style="color: black;">name</span>”: “毛利”,
“<span style="color: black;">age</span>”: <span style="color: black;">18</span>,
“ <span style="color: black;">feature</span> “ : <span style="color: black;">[‘高’, ‘富’, ‘帅’]</span>
}<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Python字典</p>{
‘<span style="color: black;">name</span>’: ‘毛利’,
‘<span style="color: black;">age</span>’: <span style="color: black;">18</span>
‘<span style="color: black;">feature</span>’ :<span style="color: black;">[‘高’, ‘富’, ‘帅’]</span>
}<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">然则</span>在python中不<span style="color: black;">能够</span>直接<span style="color: black;">经过</span>键值对来取得值,<span style="color: black;">因此</span>不得不谈谈python中的字典</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">”</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">导入python 中json,<span style="color: black;">经过</span>json.loads(s) -->将json数据转换为python的数据(字典)</strong></p>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">ajax 的<span style="color: black;">运用</span></h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Ajax 即“Asynchronous Javascript And XML”(异步 JavaScript 和 XML),<span style="color: black;">指的是</span>一种创建交互式网页应用的网页<span style="color: black;">研发</span>技术。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">”</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">照片</span>是<span style="color: black;">经过</span>ajax <span style="color: black;">办法</span>来加载的,<span style="color: black;">亦</span><span style="color: black;">便是</span>当我下拉,<span style="color: black;">照片</span>会自动加载,是<span style="color: black;">由于</span>网站自动发起了请求,</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/bc2c874bff7843edb7049a75c90c9295~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=oHES71gDnYlJg6KR9CeHCD38UqQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">分析<span style="color: black;">照片</span>url链接的位置</h3>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/42e9a35061264c88810b025665d6bae8~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=eNYIZue5aK8%2FzMlcNoCpkO3Dhh0%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">同期</span>找到对应ajax的请求的url</h3>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/016e6a4718274acb97114c2d4902a6b3~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=w%2BoIy%2BvijMj0cTqxipmBBf3dYWc%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">构造ajax的url请求,来将json转化为字典,在<span style="color: black;">经过</span>字典的键值对来取值,得到<span style="color: black;">照片</span>对应的url</strong></p><span style="color: black;">import</span> requests
<span style="color: black;">import</span> json
headers = {
<span style="color: black;">User-Agent</span>:<span style="color: black;">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36</span>}
r = requests.<span style="color: black;">get</span>(<span style="color: black;">https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=©right=&word=%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1561022599290=</span>,headers = headers).text
res = json.loads(r)[<span style="color: black;">data</span>]
<span style="color: black;">for</span>index,i<span style="color: black;">in</span> enumerate(res):
url = i[<span style="color: black;">hoverURL</span>]
print(url)
with <span style="color: black;">open</span>( <span style="color: black;">{}.jpg</span>.format(index),<span style="color: black;">wb+</span>) <span style="color: black;">as</span> f:
f.write(requests.<span style="color: black;">get</span>(url).content)
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/23e93d6c09634468beb78f78df83877f~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=tfHfQr%2BBPbDcP21OeKTElHcZgik%3D" style="width: 50%; margin-bottom: 20px;"></div>构造json的url,<span style="color: black;">持续</span>的爬取<span style="color: black;">照片</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在一个json 的有30张<span style="color: black;">照片</span>,<span style="color: black;">因此</span>发起一个json的请求,<span style="color: black;">咱们</span><span style="color: black;">能够</span>爬去30张<span style="color: black;">照片</span>,<span style="color: black;">然则</span>还是<span style="color: black;">不足</span>。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">”</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">首要</span>分析<span style="color: black;">区别</span>的json中发起的请求</p>https:<span style="color: black;">//image</span>.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=<span style="color: black;">201326592</span>&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=<span style="color: black;">2</span>&lm=-<span style="color: black;">1</span>&ie=utf-<span style="color: black;">8</span>&oe=utf-<span style="color: black;">8</span>&adpicid=&st=-<span style="color: black;">1</span>&z=&ic=<span style="color: black;">0</span>&hd=&latest=©right=&word=%E5%9B%BE%E7%89%87&<span style="color: black;">s</span>=&se=&tab=&width=&height=&face=<span style="color: black;">0</span>&istype=<span style="color: black;">2</span>&qc=&nc=<span style="color: black;">1</span>&fr=&expermode=&force=&pn=<span style="color: black;">60</span>&rn=<span style="color: black;">30</span>&gsm=<span style="color: black;">3</span>c&<span style="color: black;">1561022599355</span>=
https:<span style="color: black;">//image</span>.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=<span style="color: black;">201326592</span>&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=<span style="color: black;">2</span>&lm=-<span style="color: black;">1</span>&ie=utf-<span style="color: black;">8</span>&oe=utf-<span style="color: black;">8</span>&adpicid=&st=-<span style="color: black;">1</span>&z=&ic=<span style="color: black;">0</span>&hd=&latest=©right=&word=%E5%9B%BE%E7%89%87&<span style="color: black;">s</span>=&se=&tab=&width=&height=&face=<span style="color: black;">0</span>&istype=<span style="color: black;">2</span>&qc=&nc=<span style="color: black;">1</span>&fr=&expermode=&force=&pn=<span style="color: black;">30</span>&rn=<span style="color: black;">30</span>&gsm=<span style="color: black;">1</span>e&<span style="color: black;">1561022599290</span>=<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">其实<span style="color: black;">能够</span><span style="color: black;">发掘</span>,当再次发起请求时,关键<span style="color: black;">便是</span>那个 pn在<span style="color: black;">持续</span>的变动</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/25c646f5038547c4b931bc04f40e62e6~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=hkhq0ucePGK5ywIBa6SYD542sVE%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">最后封装代码,一个列表来定义生产者来存储<span style="color: black;">持续</span>的生成<span style="color: black;">照片</span>url,另一个列表来定义消费者来<span style="color: black;">保留</span><span style="color: black;">照片</span></p><span style="color: black;"># -*- coding:utf-8 -*-</span>
<span style="color: black;"># time :2019/6/20 17:07</span>
<span style="color: black;"># author: 毛利</span>
<span style="color: black;">import</span> requests
<span style="color: black;">import</span> json
<span style="color: black;">import</span> os
<span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">get_pic_url</span><span style="color: black;">(num)</span>:</span>
pic_url= []
headers = {
<span style="color: black;">User-Agent</span>: <span style="color: black;">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36</span>}
<span style="color: black;">for</span> i <span style="color: black;">in</span> range(num):
page_url = <span style="color: black;">https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=©right=&word=%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30&gsm=1e&1561022599290=</span>.format(<span style="color: black;">30</span>*i)
r = requests.get(page_url, headers=headers).text
res = json.loads(r)[<span style="color: black;">data</span>]
<span style="color: black;">if</span> res:
print(res)
<span style="color: black;">for</span> j <span style="color: black;">in</span> res:
<span style="color: black;">try</span>:
url = j[<span style="color: black;">hoverURL</span>]
pic_url.append(url)
<span style="color: black;">except</span>:
print(<span style="color: black;">该<span style="color: black;">照片</span>的url不存在</span>)
print(len(pic_url))<span style="color: black;">return</span> pic_url
<span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">down_img</span><span style="color: black;">(num)</span>:</span>
pic_url =get_pic_url(num)
<span style="color: black;">if</span> os.path.exists(<span style="color: black;">D:\<span style="color: black;">照片</span></span>):
<span style="color: black;">pass</span>
<span style="color: black;">else</span>:
os.makedirs(<span style="color: black;">D:\<span style="color: black;">照片</span></span>)
path = <span style="color: black;">D:\<span style="color: black;">照片</span>\\</span>
<span style="color: black;">for</span> index,i <span style="color: black;">in</span>enumerate(pic_url):
filename = path + str(index) +<span style="color: black;">.jpg</span>
print(filename)
<span style="color: black;">with</span> open(filename, <span style="color: black;">wb+</span>) <span style="color: black;">as</span> f:
f.write(requests.get(i).content)
<span style="color: black;">if</span> __name__ == <span style="color: black;">__main__</span>:
num = int(input(<span style="color: black;">爬取几次<span style="color: black;">照片</span>:一次30张</span>))
down_img(num)<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/737ee212d1d94508af42435fff9177e4~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=PofS8ZDABP%2F9y%2BzmeGQPERc30cQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/3fca1e17c4d34eeca16fe77546f8e378~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725103085&x-signature=13LWV2vZmBbHjUvKW9AsUaQKguQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
你的话语如春风拂面,让我心生暖意。 这夸赞甜到心里,让我感觉温暖无比。 你的见解独到,让我受益匪浅,期待更多交流。 祝福你、祝你幸福、早日实现等。 外链论坛的成功举办,是与各位领导、同仁们的关怀和支持分不开的。在此,我谨代表公司向关心和支持论坛的各界人士表示最衷心的感谢!
页:
[1]