nykek5i 发表于 2024-8-22 17:23:29

再见,itchat!再见,网页版微X!

<span style="color: black;">关注上方</span><span style="color: black;"><span style="color: black;">“</span><span style="color: black;">Python数据科学</span><span style="color: black;">”,</span></span><span style="color: black;"><span style="color: black;">选取</span>星标,</span><span style="color: black;">关键时间,<span style="color: black;">第1</span>时间送达!</span><span style="color: black;"><a style="color: black;">☞500g+超全学习资源免费领取</a></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/NOM5HN2icXzy3oUJC81lHlcGDwlibYtVXDC7sKgn1iapETEj9tN8HEJ4Qg09nWlvwTvANxQKJyHSYR4FgQ4YH0ejw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">有一个词叫做</span><strong style="color: blue;"><span style="color: black;">“三月爬虫”</span></strong><span style="color: black;">,指的是有些学生临到毕业了,需要收集数据写毕业论文,于是在网上随便找了几篇教程,学了点</span><span style="color: black;">requests</span><span style="color: black;"><span style="color: black;">乃至</span>是</span><span style="color: black;">urllib</span><span style="color: black;">
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">和正则表达式的皮毛,就<span style="color: black;">起始</span>写爬虫疯狂从网上爬数据。这些爬虫几乎<span style="color: black;">无</span>做任何<span style="color: black;">隐匿</span>自己的举动,不换IP,不设置headers,不限制速度,极易被有反爬的网站封锁,极易给没反爬的小网站<span style="color: black;">导致</span>流量压力。</p>
    </span><span style="color: black;">后来,<span style="color: black;">她们</span>又不<span style="color: black;">晓得</span>看了哪篇<span style="color: black;">文案</span>,<span style="color: black;">晓得</span>要<span style="color: black;">运用</span>代理IP,要修改</span><span style="color: black;">UserAgent</span><span style="color: black;">。于是,<span style="color: black;">她们</span>真的就<span style="color: black;">仅在</span>headers设置UserAgent,其他项一概不设置。你给他指出来,他还振振有词:<strong style="color: blue;">你看我<span style="color: black;">这般</span>能爬到数据啊,headers里面其他项目<span style="color: black;">无</span>用。</strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">事实真的是<span style="color: black;">这般</span>吗?</span></strong></span><span style="color: black;"><span style="color: black;">咱们</span>来做个实验,<span style="color: black;">首要</span><span style="color: black;">运用</span>Chrome<span style="color: black;">拜访</span> http://httpbin.org/headers 这个网站<span style="color: black;">能够</span><span style="color: black;">表示</span>当前你的</span><span style="color: black;">headers</span><span style="color: black;">。运行效果如下图所示:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/ohoo1dCmvqdZjN025JXPRAicLafElIKWmibS1SyVHSjwjBTDQNxa7Qe6GLWzXUNDAUztU27djViaOticibzLBSw5ZCQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><span style="color: black;">而后</span>,再<span style="color: black;">运用</span></span><span style="color: black;">requests</span><span style="color: black;">不设置headers请求这个URL,运行效果如下图所示:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/ohoo1dCmvqdZjN025JXPRAicLafElIKWmFOn43G9LS8N2Pbn0dTC2MXmia4Z5Z9Iiae1PtdXdHpAygWcrp0FF9LQg/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;">最后,<span style="color: black;">咱们</span>仅仅设置一个</span><span style="color: black;">UserAgent</span><span style="color: black;"><span style="color: black;">瞧瞧</span>效果:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/ohoo1dCmvqdZjN025JXPRAicLafElIKWmfPFGHdPPGdpFgSjAKF5vU06p9dS8lcIdWNPZpNttAI1t3OicOtfzaNw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><span style="color: black;">能够</span>看出来,仅仅设置一个</span><span style="color: black;">UserAgent</span><span style="color: black;">,与用浏览器<span style="color: black;">拜访</span>的 </span><span style="color: black;">Headers</span><span style="color: black;"> 还是有<span style="color: black;">非常多</span>不<span style="color: black;">同样</span>的<span style="color: black;">地区</span>。缺了<span style="color: black;">非常多</span>项。网站只需要检测缺的这几项,就能确定你是用程序发起的请求还是用浏览器发的请求。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">说回<span style="color: black;">微X</span>网页版的问题</span></strong></span><span style="color: black;"><span style="color: black;">非常多</span>人<span style="color: black;">运用</span>&nbsp;</span><span style="color: black;">wxpy</span><span style="color: black;"><span style="color: black;">或</span>&nbsp;</span><span style="color: black;">itchat</span><span style="color: black;">这种第三方库<span style="color: black;">经过</span>Python<span style="color: black;">掌控</span>自己的<span style="color: black;">微X</span>号,实现<span style="color: black;">非常多</span>自动化操作。但不久以后就反馈说自己被限制登录网页版<span style="color: black;">微X</span>了,以为是不是自己的<span style="color: black;">行径</span>被<span style="color: black;">微X</span><span style="color: black;">发掘</span>了,例如一秒钟内发了几十条<span style="color: black;">信息</span>,<span style="color: black;">或</span><span style="color: black;">同期</span>回复了好几个人的<span style="color: black;">信息</span>。</span><span style="color: black;">但我要说的是,你们太高估自己了,<span style="color: black;">微X</span>要<span style="color: black;">发掘</span>你们,<span style="color: black;">基本</span>就<span style="color: black;">不消</span>这么麻烦。它直接<span style="color: black;">检测</span>headers就<span style="color: black;">能够</span>了。</span><span style="color: black;"><span style="color: black;">咱们</span>来看一下</span><span style="color: black;">wxpy</span><span style="color: black;">的源代码中,<span style="color: black;">触及</span>到网络请求的<span style="color: black;">地区</span>:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/ohoo1dCmvqdZjN025JXPRAicLafElIKWmPKse9sZnWgdw513vs6N0Wczc2uxwxicpRDw8ic1mLhmicAeDCXdTsPooQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;">wxpy</span><span style="color: black;">是基于</span><span style="color: black;">itchat</span><span style="color: black;">二次<span style="color: black;">研发</span>的,登录功能是<span style="color: black;">经过</span> itchat 来实现的。<span style="color: black;">咱们</span>再来<span style="color: black;">瞧瞧</span>itchat里面发起网络请求的<span style="color: black;">地区</span>:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/ohoo1dCmvqdZjN025JXPRAicLafElIKWmaicTLKnu3HLkvm6Cx3dQiaicYlu8mlo0FmwUctsQdDo3DjcxWE4FV44Pg/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;">其中的 </span><span style="color: black;">self.core.s</span><span style="color: black;"><span style="color: black;">便是</span>一个 requests 的 Session,如下图所示:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/ohoo1dCmvqdZjN025JXPRAicLafElIKWmrrpfHtPYm0pgBva1OFA91HuiczDUe8rDWzYTxjaTP9tXPRHxOuV8fnw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">两库不要随便用</span></strong></span><span style="color: black;">这两个库,<span style="color: black;">她们</span>在</span><span style="color: black;">headers</span><span style="color: black;">里面只放了</span><span style="color: black;">UserAgent</span><span style="color: black;">,其他字段都<span style="color: black;">无</span>放。<span style="color: black;">因此</span>在你登录的瞬间,<span style="color: black;">微X</span>就<span style="color: black;">已然</span><span style="color: black;">晓得</span>你这个账号<span style="color: black;">无</span>用浏览器登录了!</span><strong style="color: blue;"><span style="color: black;"><span style="color: black;">因此</span>,<span style="color: black;">哪些</span>用了wxpy<span style="color: black;">或</span>itchat就被限制登录网页版<span style="color: black;">微X</span>的人,不要<span style="color: black;">可疑</span>,你们<span style="color: black;">便是</span>被这两个库给害了。</span></strong><span style="color: black;">这两个库里面<span style="color: black;">触及</span>到网络请求的<span style="color: black;">关联</span>代码,水平一看<span style="color: black;">便是</span>一个学了两三天爬虫的人写出来的代码。</span><span style="color: black;">你用这两个库<span style="color: black;">便是</span>让你的<span style="color: black;">微X</span>号去送死。</span><span style="color: black;"><span style="color: black;">不仅</span>是这两个库,<span style="color: black;">咱们</span>再<span style="color: black;">瞧瞧</span><span style="color: black;">非常多</span>人<span style="color: black;">运用</span>的Python 弹幕包,更夸张,在获取斗鱼直播信息的时候,直接用</span><span style="color: black;">requests</span><span style="color: black;">请求网址,连</span><span style="color: black;">headers</span><span style="color: black;">都<span style="color: black;">无</span>设置,如下图所示:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/ohoo1dCmvqdZjN025JXPRAicLafElIKWm7V2zSsue1xf8lghZCNTo1MDrb7n0DiaXE5stlvDZB2OiahWWtQWJxoVQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><strong style="color: blue;">后果很严重</strong></span><span style="color: black;"><span style="color: black;">此刻</span>大网站的<span style="color: black;">设备</span><span style="color: black;">行径</span>对抗团队<span style="color: black;">通常</span>会把检测爬虫与封禁爬虫<span style="color: black;">掰开</span>。<span style="color: black;">由于</span>反爬虫策略多了以后,不可避免存在误伤的<span style="color: black;">状况</span>,为了尽可能降低误伤率,<span style="color: black;">检测</span>爬虫时会对请求的可疑性进行打分,当你<span style="color: black;">显现</span>疑似爬虫<span style="color: black;">行径</span>时,给你的请求加上<span style="color: black;">有些</span>分数,某些<span style="color: black;">行径</span>分数高,某些<span style="color: black;">行径</span>分数低。当你总积分达到<span style="color: black;">必定</span>程度时,再调用封禁的流程。</span><span style="color: black;"><span style="color: black;">因为</span></span><span style="color: black;">HTTP</span><span style="color: black;">是无状态的,<span style="color: black;">倘若</span>你要爬的网站不需要登录,<span style="color: black;">那样</span><span style="color: black;">亦</span>许你频繁更换 IP 有用(阿布云的代理池<span style="color: black;">便是</span>被<span style="color: black;">这般</span>污染的)。</span><span style="color: black;"><span style="color: black;">然则</span><span style="color: black;">针对</span><span style="color: black;">微X</span>这种需要登录的<span style="color: black;">状况</span>,你的所有可疑<span style="color: black;">行径</span>的积分都会直接<span style="color: black;">相关</span>到你的这个账号上。于是,一<span style="color: black;">起始</span>可能你用 </span><span style="color: black;">wxpy</span><span style="color: black;"> 登录网页版<span style="color: black;">微X</span>没问题,这个时候你的可疑性积分还<span style="color: black;">不足</span>高,可能确实有<span style="color: black;">有些</span>老古董浏览器的 </span><span style="color: black;">Headers</span><span style="color: black;"> <span style="color: black;">便是</span>少了<span style="color: black;">非常多</span>项?</span><span style="color: black;"><span style="color: black;">然则</span>你<span style="color: black;">已然</span>在怀疑名单里面了。一旦你又<span style="color: black;">显现</span>了其他可疑<span style="color: black;">行径</span><span style="color: black;">引起</span>可疑性积分继续<span style="color: black;">增多</span>,<span style="color: black;">那样</span>当<span style="color: black;">微X</span><span style="color: black;">已然</span><span style="color: black;">能够</span>100%确信你<span style="color: black;">便是</span>用的自动化程序登录网页版<span style="color: black;">微X</span>的时候,封禁你<span style="color: black;">便是</span>自然而然的事情了。</span><span style="color: black;">- 完 -</span><span style="color: black;"><span style="color: black;">举荐</span>阅读</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">微软太良心,这么强大的软件竟然完全免费!</a></p><a style="color: black;">真香!Linux命令<span style="color: black;">查找</span>神器来了</a><span style="color: black;">!</span><a style="color: black;">VS Code「彩虹屁」插件<span style="color: black;">面世</span>,网友:我想</a>要郭德纲版<a style="color: black;">中国重新<span style="color: black;">研发</span>MATLAB要多久?网友:<span style="color: black;">最少</span>十年</a>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;"><a style="color: black;">卧槽,又来一个Windows神器!!!</a><span style="color: black;"><a style="color: black;">太香了!<span style="color: black;">举荐</span>6个Python数据分析神器!!</a></span><span style="color: black;">-- <span style="color: black;">尤其</span><span style="color: black;">举荐</span> --</span></h1><span style="color: black;"><span style="color: black;">尤其</span><span style="color: black;">举荐</span>:一个<span style="color: black;">优秀</span>的<span style="color: black;">举荐</span>Github开源项目的公众号</span><span style="color: black;">「GitHuboy」</span><span style="color: black;">,<span style="color: black;">每日</span>给<span style="color: black;">大众</span>分享前沿、<span style="color: black;">优秀</span>的项目,<span style="color: black;">触及</span> Java、Python、Go、Web前端、AI、数据分析等多个<span style="color: black;">行业</span>,非常值得<span style="color: black;">大众</span>关注。</span><span style="color: black;">关注回复<span style="color: black;">「Java学习」</span>可<span style="color: black;">得到</span>1024G的Java学习资料,</span><span style="color: black;">回复</span><span style="color: black;">「Python学习」</span><span style="color: black;">可<span style="color: black;">得到</span>100G的Python学习资料。</span><img src="https://mmbiz.qpic.cn/mmbiz_jpg/Iefry9dPrYLcoBANuN5iaIXEzuOiaiaJswqke19Stic8AQwKqOKRBQPiaSEtl6XibnduR4JibeCIiciahh8dloZXHsWvotw/640?wx_fmt=jpeg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">




4zhvml8 发表于 2024-10-4 21:12:06

期待楼主的下一次分享!”

4lqedz 发表于 4 天前

“板凳”(第三个回帖的人)‌
页: [1]
查看完整版本: 再见,itchat!再见,网页版微X!