爬虫之模拟用户登录思路及多种实现

前言

我们在爬虫的时候，往往会遇到有一些网页是需要登录才能访问的，今天我们就来聊聊爬虫中如何模拟用户登录

Cookie 是浏览器存储存储用户信息的一小段文本，它保存了用户的 ID 等信息，这些信息可以被服务器端识别，并作为标识用户的手段，以此来判定用户是不是第一次访问。网络上有这样一段解释，很清楚。

浏览器与 WEB 服务器之间是使用 HTTP 协议进行通信的；而 HTTP 协议是无状态协议。也就是说，当某个用户发出页面请求时，WEB 服务器只是简单的进行响应，然后就关闭与该用户的连接。因此当一个请求发送到 WEB 服务器时，无论其是否是第一次来访，服务器都会把它当作第一次来对待，这样的不好之处可想而知。为了弥补这个缺陷，Netscape 开发出了 cookie 这个有效的工具来保存某个用户的识别信息，它是一种 WEB 服务器通过浏览器在访问者的硬盘上存储信息的手段。它是服务器发送给浏览器的体积很小的纯文本信息。

定义：cookie 是 Web 浏览器存储的少量命名数据，它与某个特定的网页和网站关联在一起。cookie 实际上和某一网站和网页关联的，保存用户一定信息的文本文件。

Cookie 是当你浏览某网站时，网站存储在你机器上的一个小文本文件，它记录了你的用户 ID，密码、浏览过的网页、停留的时间等信息，当你再次来到该网站时，网站通过读取 Cookie，得知你的相关信息，就可以做出相应的动作，如在页面显示欢迎你的标语，或者让你不用输入 ID、密码就直接登录等等。

在这篇博客中，我们将描述在 HttpClient 4.X 中，如何使用 Cookie，保存 Cookie，并利用已登录的 Cookie 访问页面。

首先，在 HttpClient 4.X 中，使用 HttpContext 来保存请求的上下文信息。说白了，就是用一个类来保存请求的信息。比如，如果使用 HttpClient 请求利用了 HttpContext，那么在请求的时候会带有 HttpContext 里面保存的一些信息，如 sessionId 等。同时，HttpClient 请求完了之后也会把从服务器得到的一些信息保存下来，下次请求的时候，使用这个 HttpContext 就会带上这次请求所保存的信息了。BasicHttpContext 里有个 Map 对象用来记录一次请求响应的信息，当响应信息返回时，就会被 set 到 context 里，当然响应的 cookie 信息也就被存储在 context 里,包括传回的 sessionId。当第二次请求的时候传入相同的 context，那么请求的过程中会将 context 里的 sessionId 提取出来传给服务器，sessionId 一样，自然而然的就是同一个 session 对象。

下面我们看一个使用 HttpContext 带 Cookie 请求的案例。

首先引入 Apache 增强版的 HttpClient，比原生自带的类多了很多实用的 API

依赖

  <dependency>
		<groupId>org.apache.httpcomponents</groupId>
		<artifactId>httpclient</artifactId>
		<version>4.5.3</version>
 </dependency>

使用 HTTPClient 登录

思路

使用 Fiddler 抓取到登录的请求，获取到请求参数与请求头等，使用 HTTPClient 构造请求，请求成功后，获取响应头中的 Cookie，之后我们就可以拿着“登录信息”去访问登录后才能访问的页面了。

 @Test
  public void login() throws URISyntaxException, IOException {

		//创建一个HttpContext对象，用来保存Cookie
  HttpClientContext httpClientContext = HttpClientContext.create();

  String url = "https://sanii.cn/login";
  //建立一个新的httpclient 请求
  CloseableHttpClient httpClient = HttpClients.createDefault();
  //构造路径参数
  List<NameValuePair> nameValuePairList = Lists.newArrayList();
  nameValuePairList.add(new BasicNameValuePair("username", "xxx"));
  nameValuePairList.add(new BasicNameValuePair("password", "xxxx"));
  nameValuePairList.add(new BasicNameValuePair("remeberMe", "on"));

  URI build = new URIBuilder(url).addParameters(nameValuePairList).build();

  //构造Headers
  List<Header> headerList = Lists.newArrayList();
  headerList.add(new BasicHeader(HttpHeaders.ACCEPT, "Accept: application/json, text/javascript, */*; q=0.01"));
  headerList.add(new BasicHeader(HttpHeaders.REFERER, "https://sanii.cn/admin/login"));
  headerList.add(new BasicHeader(HttpHeaders.CONNECTION, "keep-alive"));
  headerList.add(new BasicHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate, br"));
  headerList.add(new BasicHeader(HttpHeaders.ACCEPT_LANGUAGE, "zh-CN,zh;q=0.9"));

  //构造HttpClient
  CloseableHttpClient client = HttpClients.custom().setDefaultHeaders(headerList).build();

  //构造HttpGet请求
  HttpUriRequest request = RequestBuilder.post(build).build();
  //获取结果
  CloseableHttpResponse httpResponse = client.execute(request,httpClientContext);
  //获取返回结果中的实体
  HttpEntity entity = httpResponse.getEntity();

  String content = EntityUtils.toString(entity);

  //获取结果cookie
  CookieStore cookieStore = httpClientContext.getCookieStore();

  //后台首页,拿着登录成功后的Cookie去访问后台
  CloseableHttpClient indexClient = HttpClientBuilder.create().setDefaultCookieStore(cookieStore).build();
  HttpUriRequest index = RequestBuilder.get("https://sanii.cn/xxx/").build();
  CloseableHttpResponse indexresponse = indexClient.execute(index, httpClientContext);
  HttpEntity indexresponseEntity = indexresponse.getEntity();
  String indexResponse = EntityUtils.toString(indexresponseEntity);

  }

结果

这里显示登录成功，再看看响应体中的 Cookie

接着我们带着登录后的 Cookie 访问后台

测试成功了

优化

下面我们看一个使用 HttpContext 带 Cookie 请求的案例。在 HttpClient 4.X 中，

使用 HttpContext 来保存请求的上下文信息。说白了，就是用一个类来保存请求的信息。比如，如果使用 HttpClient 请求利用了 HttpContext，那么在请求的时候会带有 HttpContext 里面保存的一些信息，如 sessionId 等。同时，HttpClient 请求完了之后也会把从服务器得到的一些信息保存下来，下次请求的时候，使用这个 HttpContext 就会带上这次请求所保存的信息了。BasicHttpContext 里有个 Map 对象用来记录一次请求响应的信息，当响应信息返回时，就会被 set 到 context 里，当然响应的 cookie 信息也就被存储在 context 里,包括传回的 sessionId。当第二次请求的时候传入相同的 context，那么请求的过程中会将 context 里的 sessionId 提取出来传给服务器，sessionId 一样，自然而然的就是同一个 session 对象。

我们发现上面手动获取 Cookie 再放入太麻烦了，以下是优化后的代码。

//创建一个HttpContext对象，用来保存Cookie
HttpClientContext httpClientContext = HttpClientContext.create();

String url = "https://sanii.cn/login";
//建立一个新的httpclient 请求
CloseableHttpClient httpClient = HttpClients.createDefault();
//构造路径参数
List<NameValuePair> nameValuePairList = Lists.newArrayList();
nameValuePairList.add(new BasicNameValuePair("username", "xxx"));
nameValuePairList.add(new BasicNameValuePair("password", "xxxx"));
nameValuePairList.add(new BasicNameValuePair("remeberMe", "on"));

URI build = new URIBuilder(url).addParameters(nameValuePairList).build();

//构造Headers
List<Header> headerList = Lists.newArrayList();
headerList.add(new BasicHeader(HttpHeaders.ACCEPT, "Accept: application/json, text/javascript, */*; q=0.01"));
headerList.add(new BasicHeader(HttpHeaders.REFERER, "https://sanii.cn/admin/login"));
headerList.add(new BasicHeader(HttpHeaders.CONNECTION, "keep-alive"));
headerList.add(new BasicHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate, br"));
headerList.add(new BasicHeader(HttpHeaders.ACCEPT_LANGUAGE, "zh-CN,zh;q=0.9"));

//构造HttpClient
CloseableHttpClient client = HttpClients.custom().setDefaultHeaders(headerList).build();

//构造HttpGet请求
HttpUriRequest request = RequestBuilder.post(build).build();
//获取结果
CloseableHttpResponse httpResponse = client.execute(request,httpClientContext);
//获取返回结果中的实体
HttpEntity entity = httpResponse.getEntity();

//访问后台首页
CloseableHttpResponse indexResponse = httpClient.execute(new HttpGet("https://sanii.cn/xxx/"), httpClientContext);

使用 SelenIUM 模拟浏览器登录

我在进行测试的时候，发现一些网站反爬虫技术会多一些，比如腾讯、新浪等。虽然登录的时候没有验证码，但是登录的时候参数加密，很无奈。

这时候我们使用 SelemIUM 自动化测试工具，来模拟正常用户在浏览器的登录“操作”，上次介绍过了 selemIUM+chrome 可视化界面的使用，效率低下，现在。我们重点使用无界面浏览器，HtmlUnit 和 PhantomJS.

具体详情可以移步 Selenium 官网对所有浏览器的介绍：Selenium WebDriver 或者百度介绍

PhantomJS 登录 QQ 空间

这里使用到了 SelemIUM+PhantomJS+HTTPRequest 等框架。可以看我第一篇关于 SelemIUM 的介绍。

public static void main(String[] args) throws IOException {

		//使用驱动工厂类创建
//        WebDriver driver = new DriverStrategyFactory(DriverEnum.Chrome).setDriverPath(DriverPathEnum.Chrome);
  WebDriver driver = new DriverStrategyFactory(DriverEnum.PhantomJS).setDriverPath(DriverPathEnum.PhantomJS);
  driver.get("https://qzone.qq.com/");
  WebDriver login_frame = driver.switchTo().frame(driver.findElement(By.id("login_frame")));
  login_frame.findElement(By.id("switcher_plogin")).click();
  new WebDriverWait(driver, 1);
  login_frame.findElement(By.id("u")).sendKeys("1300100082");
  new WebDriverWait(driver, 1);
  login_frame.findElement(By.id("p")).sendKeys("xxxx");
  new WebDriverWait(driver, 1);
  login_frame.findElement(By.id("login_button")).click();
  // 等待页面加载完毕，超时时间设为3秒
  (new WebDriverWait(driver, 3)).until(new ExpectedCondition<Boolean>() {
			@Override
  public Boolean apply(WebDriver d) {
				//这里对加了校验，如果不加校验，当页面DOM还没加载完成时去获取了DOM节点，此时会抛异常。这里对DOM进行校验，如果节点存在则往下走，抛异常则继续等待3秒。
  return isLoad(d);
  }
			private Boolean isLoad(WebDriver d) {
				try {
					login_frame.findElement(By.id("tab_menu_friend")).findElement(By.className("qz-main"));
  return true;
  } catch (Exception e) {
					return false;
  }
			}
		});
  login_frame.findElement(By.id("tab_menu_friend")).findElement(By.className("qz-main")).click();
  //获取cookie
  Set<Cookie> cookies = login_frame.manage().getCookies();
  login_frame.quit();
  //创建一个HttpContext对象，用来保存Cookie
  HttpClientContext httpClientContext = HttpClientContext.create();
  BasicCookieStore basicCookieStore = new BasicCookieStore();
  cookies.forEach(cookie -> {
			BasicClientCookie clientCookie = new BasicClientCookie(cookie.getName(), cookie.getValue());
  clientCookie.setDomain(cookie.getDomain());
  clientCookie.setPath(cookie.getPath());
  basicCookieStore.addCookie(clientCookie);

  });
  httpClientContext.setCookieStore(basicCookieStore);

//        CloseableHttpClient build = HttpClientBuilder.create().setDefaultCookieStore(basicCookieStore).build();
//        HttpUriRequest request = RequestBuilder.get("https://user.qzone.qq.com/1300100082/").build();
//        HttpEntity httpEntity = build.execute(request).getEntity();

  CloseableHttpResponse execute = HttpClients.createDefault().execute(new HttpGet("https://user.qzone.qq.com/1300100082/"), httpClientContext);
  String index = EntityUtils.toString(execute.getEntity());
  }

结果

获取登录成功的 Cookie

这里登录成功并且成功获取到 Cookie，接下来我们使用 Cookie 去登录个人主页

成功进入个人主页。

代码封装

因为有时候我们根据不同的场景使用不同的驱动（httpunit、PhantomJS、chrome），因此封装成一套简单的工具类。

下载地址：点我下载

使用了策略 + 工厂设计模式

接口

/**
 * 驱动策略接口，根据不同枚举使用不同驱动
  * @author Administrator
 */public interface DriverStrategy {

  /**
 * * @param driverEnum 驱动枚举
  * @return 获取驱动
  */
  WebDriver getDriver(DriverEnum driverEnum);
}

三个驱动分别实现此接口，创建不同的驱动（这里就不贴代码了）

驱动工厂类

/**
 * 驱动策略工厂
  *
 * @author Administrator
 */public class DriverStrategyFactory {

  /**
 * 获取策略驱动
  *
 * @param driverEnum 枚举驱动
  * @link me.liao.gecco.selenium.utils.DriverEnum
 * @return
  */
  public static WebDriver getInstance(DriverEnum driverEnum) {
  switch (driverEnum) {
  case HtmlUnit:
  return new HtmlUnitDriverStrategy().getDriver(driverEnum);
  case PhantomJS:
  return new PhantomJSDriverStrategy().getDriver(driverEnum);
  case Chrome:
  return new ChromeDriverStrategy().getDriver(driverEnum);
  default:
  return null;
  }
 }}

驱动枚举

/**
 * 驱动策略枚举
  */
public enum DriverEnum {

  HtmlUnit("HtmlUnitDriver", 1, "HtmlUnit无需设置路径"), PhantomJS("PhantomJSDriver", 2, "driver/phantomjs.exe"), Chrome("ChromeDriver", 3, "driver/chromedriver.exe");

  private String name;
  private int code;
  private String des;

  DriverEnum(String name, int code, String des) {
  this.name = name;
  this.code = code;
  this.des = des;
  }

  public String getName() {
  return name;
  }

  public void setName(String name) {
  this.name = name;
  }

  public int getCode() {
  return code;
  }

  public void setCode(int code) {
  this.code = code;
  }

  public String getDes() {
  return des;
  }

  public void setDes(String des) {
  this.des = des;
  }
}

简单使用

//以下三种
WebDriver HtmlUnit = DriverStrategyFactory.getInstance(DriverEnum.HtmlUnit);
WebDriver PhantomJS = DriverStrategyFactory.getInstance(DriverEnum.PhantomJS);
WebDriver Chrome = DriverStrategyFactory.getInstance(DriverEnum.Chrome);

三大驱动个人总结

HtmlUnit

适合页面数据是直接显示结果的（非 Ajax 异步加载）
页面不能有过多的重定向
无需驱动
无界面

PhantomJs

适中，能适应绝大场景
需要驱动
页面重定向次数不能过多，但是比 HtmlUint 强
无界面

CHrome

最稳定最强
效率低（需要每次等页面加载完成，渲染 CSS，加载 JS 慢）
需要驱动
有界面

坑：

我抓取一个网站的时候，在点击查询的时候，网站连续重定向 7 个请求，同时每个请求特别慢！总加载时间需要 20s 左右！这时候 HtmlUint 和 PhantomJs 都不能得到最终重定向地址！使用的时候大家一定要慎重选择！

爬虫之模拟用户登录思路及多种实现

前言

依赖

使用 HTTPClient 登录

思路

优化

使用 SelenIUM 模拟浏览器登录

PhantomJS 登录 QQ 空间

结果

代码封装

接口

驱动工厂类

驱动枚举

简单使用

三大驱动个人总结

HtmlUnit

PhantomJs

CHrome

坑：

相关帖子

Python 运行 Selenium

selenium python 自动滑块检测

Exception in thread "main" java.lang.NoClassDefFoundError: org/openqa/selenium/HasAuthentication

Python+Selenium 实现浏览器自动化操作

Selenium 获取天气网乡镇编码

Selenium 自动化测试实践

记录一次 Selenium WebDriver 无法正常加载 www.xxx.com 域名 Cookie 文件的解决办法

欢迎来到这里！