esc0rp 发表于 2024-8-25 15:10:35

详解Nginx怎么样查看搜索引擎蜘蛛爬虫行径:爬行次数、爬行页面等


    <div style="color: black; text-align: left; margin-bottom: 10px;">
      <h1 style="color: black; text-align: left; margin-bottom: 10px;">概述</h1>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">近期</span>阿里云经常会被<span style="color: black;">各样</span>爬虫<span style="color: black;">光临</span>,有的是搜索引擎爬虫,有的不是,<span style="color: black;">一般</span><span style="color: black;">状况</span>下这些爬虫都有UserAgent,而<span style="color: black;">咱们</span><span style="color: black;">晓得</span>UserAgent是<span style="color: black;">能够</span>伪装的,UserAgent的本质是Http请求头中的一个选项设置,<span style="color: black;">经过</span>编程的方式<span style="color: black;">能够</span>给请求设置任意的UserAgent。 </p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">下面的Linux命令<span style="color: black;">能够</span>让你清楚的<span style="color: black;">晓得</span>蜘蛛的爬行<span style="color: black;">状况</span>。<span style="color: black;">咱们</span>针对nginx服务器进行分析,日志文件所在目录:</p>/usr/local/nginx/logs/access.log,access.log这个文件记录的应该是<span style="color: black;">近期</span>一天的日志<span style="color: black;">状况</span>,<span style="color: black;">首要</span>请<span style="color: black;">瞧瞧</span>日志<span style="color: black;">体积</span>,<span style="color: black;">倘若</span>很大(超过50MB)<span style="color: black;">意见</span>别用这些命令分析,<span style="color: black;">由于</span>这些命令很消耗CPU,<span style="color: black;">或</span>更新下来放到分析机上执行,<span style="color: black;">以避免</span>影响服务器性能。
      <h1 style="color: black; text-align: left; margin-bottom: 10px;">常用蜘蛛的域名</h1>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">常用蜘蛛的域名都和搜索引擎官网的域名<span style="color: black;">关联</span>,例如:</p>百度的蜘蛛<span style="color: black;">一般</span>是baidu.com<span style="color: black;">或</span>baidu.jp的子域名google爬虫<span style="color: black;">一般</span>是googlebot.com的子域名微软bing搜索引擎爬虫是search.msn.com的子域名搜狗蜘蛛是crawl.sogou.com的子域名<h1 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;">1、计算百度蜘蛛爬行的次数</strong></h1>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">cat access.log | grep Baiduspider | wc</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/81381c27451e41afa6c83d2c8df113c8~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1725097663&amp;x-signature=BaswPVQobxjBfzGcZSwMatI7aMs%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">最左面的数值<span style="color: black;">表示</span>的<span style="color: black;">便是</span>爬行次数。</p>
      <h1 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;">2、百度蜘蛛的<span style="color: black;">仔细</span>记录(Ctrl C<span style="color: black;">能够</span>终止)</strong></h1>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">cat access.log | grep Baiduspider</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/f6a6dd2157b0455e8f2bb4a9c02bd10a~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1725097663&amp;x-signature=uROWUzc05WeYqYVeDgTwSU6teGU%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">亦</span><span style="color: black;">能够</span>用下面的命令:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">cat access.log | grep Baiduspider | tail -n 10</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">cat access.log | grep Baiduspider | head -n 10</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/cdf2c27446b04cb68d2bb703747b9ec3~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1725097663&amp;x-signature=HCgN4nlPC9l68%2Fav1AKxmkGOmW4%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">说明:只看最后10条或最前10条</p>
      <h1 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;">3、百度蜘蛛抓取首页的<span style="color: black;">仔细</span>记录</strong></h1>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">cat access.log | grep Baiduspider | grep “GET / HTTP”</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/96fdfa4414bc42cb8dddc4a4bfaa16c0~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1725097663&amp;x-signature=3BR6TPC24DWctZsMTHCnf2JLBLU%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">百度蜘蛛<span style="color: black;">好似</span>对首页非常热爱<span style="color: black;">每一个</span>钟头都来<span style="color: black;">光临</span>,而谷歌和雅虎蜘蛛更<span style="color: black;">爱好</span>内页。</p>
      <h1 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;">4、百度蜘蛛派性记录时间点分布</strong></h1>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">cat access.log | grep “Baiduspider ” | awk ‘{print $4}</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/04490385dbdf4d6bac68014359094799~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1725097663&amp;x-signature=bqhbuP4KDgJGhGO4N%2BeWZDiIIO8%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <h1 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;">5、百度蜘蛛爬行页面按次数降序列表</strong></h1># cat access.log |grep "Baiduspider"|awk {print $7}|sort | uniq -c |sort -r<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/1a9732538e794d7486684a4cbfd4acb2~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1725097663&amp;x-signature=CMHf95%2BpRHm5Au%2ByrFWnOdL26Hk%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">篇幅有限,关于nginx去查看搜索引擎蜘蛛爬虫的<span style="color: black;">行径</span>的内容就介绍到这了,上面的<span style="color: black;">有些</span>命令都是比较常用的,后面会分享<span style="color: black;">更加多</span>关于nginx方面内容,感兴趣的<span style="color: black;">伴侣</span><span style="color: black;">能够</span>关注下!</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/4bcdfef956b142859cf4e11a43e1cb19~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1725097663&amp;x-signature=hKHtpNqmHpRBhxUYAgF%2BKiXvVL4%3D" style="width: 50%; margin-bottom: 20px;"></div>
    </div>




j8typz 发表于 2024-9-30 03:23:24

感谢您的精彩评论,为我带来了新的思考角度。

4zhvml8 发表于 2024-11-11 21:51:16

这夸赞甜到心里,让我感觉温暖无比。

nqkk58 发表于 2024-11-14 02:22:52

谷歌外贸网站优化技术。
页: [1]
查看完整版本: 详解Nginx怎么样查看搜索引擎蜘蛛爬虫行径:爬行次数、爬行页面等