使用 java httpClient 做爬虫, 一些总结

总结

过去一直使用 mechanize 包来做爬虫, 功能非常强大, 几乎无所不能。最近正在将技术栈转移到 Java, 所以开发抓取功能时也想找到和 mechanize 一样强大的包。

工具采用 HttpClient

http 请求头信息

文档: http://hc.apache.org/httpclient-3.x/logging.html

通过配置，能得到像 mechanize 里面的请求头信息

DEBUG [org.apache.http.headers] http-outgoing-0 >> GET / HTTP/1.1
DEBUG [org.apache.http.headers] http-outgoing-0 >> Host: www.baidu.com
DEBUG [org.apache.http.headers] http-outgoing-0 >> Connection: Keep-Alive
DEBUG [org.apache.http.headers] http-outgoing-0 >> User-Agent: Apache-HttpClient/4.5.3 (Java/1.8.0_144)
DEBUG [org.apache.http.headers] http-outgoing-0 >> Accept-Encoding: gzip,deflate
DEBUG [org.apache.http.headers] http-outgoing-0 << HTTP/1.1 200 OK
DEBUG [org.apache.http.headers] http-outgoing-0 << Server: bfe/1.0.8.18
DEBUG [org.apache.http.headers] http-outgoing-0 << Date: Mon, 04 Dec 2017 07:29:18 GMT
DEBUG [org.apache.http.headers] http-outgoing-0 << Content-Type: text/html
DEBUG [org.apache.http.headers] http-outgoing-0 << Last-Modified: Mon, 23 Jan 2017 13:28:24 GMT
DEBUG [org.apache.http.headers] http-outgoing-0 << Transfer-Encoding: chunked
DEBUG [org.apache.http.headers] http-outgoing-0 << Connection: Keep-Alive
DEBUG [org.apache.http.headers] http-outgoing-0 << Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
DEBUG [org.apache.http.headers] http-outgoing-0 << Pragma: no-cache
DEBUG [org.apache.http.headers] http-outgoing-0 << Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
DEBUG [org.apache.http.headers] http-outgoing-0 << Content-Encoding: gzip

修改 user-agent

通过请求头信息看到, User-Agent: Apache-HttpClient/4.5.3 (Java/1.8.0_144) 这肯定是不行的，高级爬虫肯定是需要做到 100% 模拟用户行为，所以 User-Agent 肯定需要进行修改，mechanize 里面很容易修改，httpclient 也能做到非常简单的修改。

CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");

这样设置是不方便的，每次请求都需要设置, 如果能在初始化 httpclient 的时候，就设置好，后面每次请求就不用单独设置了。代码如下:

CloseableHttpClient httpclient = HttpClients.custom()
  .setUserAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0")
  .build();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
CloseableHttpResponse response1 = httpclient.execute(httpGet);

修改 reffer

reffer 在 http 请求中是非常重要的, 直接标明来源, 防爬虫策略中校验是否存在 reffer 或者判断 reffer 是否正确, 显示非常重要。mechanize 在请求中修改 reffer 是非常容易的, 我们来看看 httpclient 中如何来修改 reffer。

CloseableHttpClient httpclient = HttpClients.custom()
  .setUserAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0")
  .build();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
httpGet.addHeader("Referer", "http://www.google.com");
CloseableHttpResponse response1 = httpclient.execute(httpGet);

请求上下文

爬虫往往不仅仅只访问一个页面,而是一连串几个页面, 比如获取用户的余额, 步骤为:
1、访问登录界面，进行用户名和密码登录
2、进入个人中心页面
3、访问余额页面, 解析出余额
以上 3 步,是连续的,需要每一次访问种的 Cookie,需要一个关联的上下文信息。在 mechanize 里，是非常容易做到这一点的, 声明一个 httpclient 对象, 在一个进程里，多次请求自然就关联上了。httpclient 里面又是如何操作呢? HttpClient 4.x 的版本己经自动实现了

参考文档: http://blog.csdn.net/column/details/httpclient.html

代理

使用代理去爬取相关网页,在爬虫技术中是必须要掌握的, httpclient 非常简单, 如下:

HttpGet httpGet = new HttpGet("http://www.baidu.com");
HttpHost proxy = new HttpHost("111.178.233.26", 8081);
RequestConfig requestConfig = RequestConfig.custom().setProxy(proxy).build();
httpGet.setConfig(requestConfig);
CloseableHttpResponse response1 = httpclient.execute(httpGet);

解析网页

jsoup 是一款 Java 的 HTML 解析器，可直接解析某个 URL 地址、HTML 文本内容。它提供了一套非常省力的 API，可通过 DOM，CSS 以及类似于 jQuery 的操作方法来取出和操作数据。

Spring 事务 @Transaction 讲解

概述 spring 提了事务支持，使得事务操作变的更加方便供。 Spring 事务实现有哪些方式？声明式事务：声明式事务也有两种实现方式，基于 xml 配置文件的方式和注解方式（在类上添加 @Transaction 注解）。编码方式：提供编码的形式管理和维护事务。说一下 spring 的事务隔离？ spring ..

软件架构设计原则

1.1 开闭原则开闭原则（Open-Closed Principle，COP）是指一个软件实体（如类、模块和函数）应该对扩展开放，对修改关闭。所谓的关闭，也正是对扩张和修改两个行为的一个原则。它强调的是用抽象构建框架，用实现扩展细节，可以提高软件系统的客服用心及可维护性。开闭原则是对面向对象设计最基础的设计原则，它知 ..

Spring Boot 2.4 版本升级指南：夯实你的多环境配置

在软件开发的世界里，Spring Boot 像是一股清流，它以约定大于配置的理念，简化了 Spring 应用的初始搭建以及开发过程。但随着版本的不断迭代，Spring Boot 在 2.4 版本中带来了一些重要的变化，尤其是在多环境配置方面。本文将带你深入理解这些变化，助你在技术的浪潮中乘风破浪。 ** 旧版本的多环境 ..

开启 Spring Boot 配置的大门

在软件的世界里，配置就像是一把打开无限可能的钥匙。想象一下，一个优雅的框架，能够让你只需几行配置，便可开启一段旅程。这就是 Spring Boot 2.x，一个能让你如此轻松地驾驭复杂后端系统的框架。今天，让我们一起走进 Spring Boot 的配置文件，探索这背后的奥秘。 ** 配置文件的基础** 在 Spring ..

欢迎来到这里！

我们正在构建一个小众社区，大家在这里相互信任，以平等 • 自由 • 奔放的价值观进行分享交流。最终，希望大家能够找到与自己志同道合的伙伴，共同成长。

关于