HtmlParse:一款超轻量级的HTML文件解析和爬取器具
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">HtmlParse 是一款基于windwos平台的HTML文档解析<span style="color: black;">工具</span>,可快速构建DOM树,从而<span style="color: black;">容易</span>实现网页元素的爬取工作。DOM树<span style="color: black;">便是</span>一个HTML文档的节点树,<span style="color: black;">每一个</span>节点由:标签(Tag)、属性(Attribute)、文本(Text)三个值来描述。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">所说</span>的HTML文档解析,指的<span style="color: black;">便是</span><span style="color: black;">怎样</span>构建一颗DOM树,<span style="color: black;">仅有</span>成功构建出DOM树,才有可能进行后续的数据爬取和分析工作。显然,构建DOM树是比较<span style="color: black;">繁杂</span>的过程,<span style="color: black;">由于</span>不是每一个HTML文档都会严格<span style="color: black;">根据</span>规范来书写,<span style="color: black;">因此呢</span>解析过程<span style="color: black;">必须</span><span style="color: black;">拥有</span><span style="color: black;">必定</span>容错能力。<span style="color: black;">另外</span>,解析效率<span style="color: black;">亦</span>是一个<span style="color: black;">必须</span><span style="color: black;">思虑</span>的<span style="color: black;">原因</span>,<span style="color: black;">亦</span><span style="color: black;">便是</span>说最好<span style="color: black;">经过</span>一次文档扫描<span style="color: black;">就可</span><span style="color: black;">创立</span>起DOM树,而不是反复扫描。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">下面是HtmlParse介绍。</p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">1、</span><span style="color: black;">工具</span>特点</h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">1、绿色纯天然,<span style="color: black;">没</span>任何第三方依赖库,文件<span style="color: black;">体积</span>不到150K;</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">2、解析速度快,<span style="color: black;">拥有</span><span style="color: black;">必定</span>的HTML语法容错能力,可快速将HMTL文档解析为DOM树;</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">3、基于命令行参数,可<span style="color: black;">经过</span><span style="color: black;">区别</span>参数获取指定TAG的属性值和文本内容,从而实现网页爬取功能;</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">4、可将爬取数据输出为json格式,方便第三方程序进一步分析和<span style="color: black;">运用</span>;</p>5、可爬取script脚本到指定的js文件中;<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">下载<span style="color: black;">位置</span>:<a style="color: black;"><span style="color: black;">http://</span><span style="color: black;">softlee.cn/HtmlParse.zi</span><span style="color: black;">p</span></a></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">2、</span><span style="color: black;">运用</span><span style="color: black;">办法</span></h2>
<div style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">HtmlParse</span> <span style="color: black;">HtmlPathFile</span> <span style="color: black;">-</span><span style="color: black;">tag</span> <span style="color: black;">TagName</span> <span style="color: black;">[</span><span style="color: black;">-</span><span style="color: black;">attr</span><span style="color: black;">]</span> <span style="color: black;">[</span><span style="color: black;">Attribute</span><span style="color: black;">]</span> <span style="color: black;">[</span><span style="color: black;">-</span><span style="color: black;">o</span><span style="color: black;">]</span> <span style="color: black;">[</span><span style="color: black;">JsonPathFile</span><span style="color: black;">]</span>
</div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">解析指定的HTML文档,并将文档中指定的标签及属性输出到指定文件中。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">HtmlPathFile:必选参数,要解析的HTML文档路径名,<span style="color: black;">倘若</span>文件路径中有空格,可<span style="color: black;">运用</span>双引号将文件路径<span style="color: black;">包括</span>;</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">-tag:必选参数,用于指定要抓取的HTML标签名<span style="color: black;">叫作</span>;</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">-attr:可选参数,用于指定标签的属性值,<span style="color: black;">倘若</span>不指定,则返回该标签的所有属性值;</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">-o:可选参数,用于指定抓取内容输出的文件,可将抓取的内容<span style="color: black;">保留</span>为json格式的文件。 <span style="color: black;">倘若</span>该参数不指定,则进行<span style="color: black;">掌控</span>台输出。 <span style="color: black;">倘若</span>抓取的是script、style则会<span style="color: black;">保留</span>为js格式文件。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">倘若</span>要抓取doctype,可<span style="color: black;">运用</span>-tag doctype,将<span style="color: black;">全部</span>doctype内容获取。此时将会忽略-attr指定的任何属性值。</p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">3、</span>举例说明</h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">1、爬取网页中所有超链接</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">HtmlParse</span> <span style="color: black;">c</span><span style="color: black;">:</span><span style="color: black;">/</span><span style="color: black;">sina</span><span style="color: black;">.</span><span style="color: black;">html</span> <span style="color: black;">-</span><span style="color: black;">tag</span> <span style="color: black;">a</span> <span style="color: black;">-</span><span style="color: black;">attr</span> <span style="color: black;">href</span> <span style="color: black;">-</span><span style="color: black;">o</span> <span style="color: black;">c</span><span style="color: black;">:</span><span style="color: black;">/</span><span style="color: black;">sina</span><span style="color: black;">.</span><span style="color: black;">json</span>
</div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">解析C盘下的sina.html文档,并提取该文档中的所有超链接到sina.json文件中。其中**-tag a -attr href**,用于指定获取超链接标签a的href属性。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">2、爬取网页中所有<span style="color: black;">照片</span>链接</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">HtmlParse</span> <span style="color: black;">c</span><span style="color: black;">:</span><span style="color: black;">/</span><span style="color: black;">sina</span><span style="color: black;">.</span><span style="color: black;">html</span> <span style="color: black;">-</span><span style="color: black;">tag</span> <span style="color: black;">img</span> <span style="color: black;">-</span><span style="color: black;">attr</span> <span style="color: black;">src</span> <span style="color: black;">-</span><span style="color: black;">o</span> <span style="color: black;">c</span><span style="color: black;">:</span><span style="color: black;">/</span><span style="color: black;">sina</span><span style="color: black;">.</span><span style="color: black;">json</span>
</div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">解析C盘下的sina.html文档,并提取该文档中的所有<span style="color: black;">照片</span>链接到sina.json文件中。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">3、爬取网页中所有脚本</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">HtmlParse</span> <span style="color: black;">c</span><span style="color: black;">:</span><span style="color: black;">/</span><span style="color: black;">sina</span><span style="color: black;">.</span><span style="color: black;">html</span> <span style="color: black;">-</span><span style="color: black;">tag</span> <span style="color: black;">script</span> <span style="color: black;">-</span><span style="color: black;">o</span> <span style="color: black;">c</span><span style="color: black;">:</span><span style="color: black;">/</span><span style="color: black;">sina</span><span style="color: black;">.</span><span style="color: black;">js</span>
</div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">解析C盘下的sina.html文档,并提取该文档中的所有脚本函数到sina.js文件中。</p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">4、</span>输出内容</h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">倘若</span><span style="color: black;">经过</span>-o参数指定输出文件,则会生成一个json格式的文档。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">TagName为爬取的标签名<span style="color: black;">叫作</span>,<span style="color: black;">例如</span>超链接的a,其值是一个json数组,数组中的<span style="color: black;">每一个</span>内容为Json对象,<span style="color: black;">每一个</span>Json对象,有属性和文本<span style="color: black;">形成</span>。<span style="color: black;">倘若</span>-attr 指定了要爬取的属性,则AttrName为指定的属性名<span style="color: black;">叫作</span>,<span style="color: black;">例如</span>href或src。text为该标签的文本内容,有些标签不存在文本内容,<span style="color: black;">例如</span>img、meta等,则该值为空。json格式如下:</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">{</span>
<span style="color: black;">"TagName"</span><span style="color: black;">:</span>
<span style="color: black;">[</span>
<span style="color: black;">{</span><span style="color: black;">"AttrName"</span><span style="color: black;">:</span><span style="color: black;">"AttrValue1"</span><span style="color: black;">,</span> <span style="color: black;">"text"</span><span style="color: black;">:</span><span style="color: black;">"text1"</span><span style="color: black;">}</span>
<span style="color: black;">{</span><span style="color: black;">"AttrName"</span><span style="color: black;">:</span><span style="color: black;">"AttrValue1"</span><span style="color: black;">,</span> <span style="color: black;">"text"</span><span style="color: black;">:</span><span style="color: black;">"text2"</span><span style="color: black;">}</span>
<span style="color: black;">]</span>
<span style="color: black;">}</span>
</div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">下面是一个sina网页的所有超链接json</p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">{</span>
<span style="color: black;">"a"</span><span style="color: black;">:</span> <span style="color: black;">[{</span>
<span style="color: black;">"href"</span><span style="color: black;">:</span> <span style="color: black;">"javascript:;"</span><span style="color: black;">,</span>
<span style="color: black;">"text"</span><span style="color: black;">:</span> <span style="color: black;">"设为首页"</span>
<span style="color: black;">},</span> <span style="color: black;">{</span>
<span style="color: black;">"hre</span>
</div>
回顾过去一年,是艰难的一年;展望未来,是辉煌的一年。 系统提示我验证码错误1500次 \~゛, “BS”(鄙视的缩写) 你的见解真是独到,让我受益良多。
页:
[1]