妹子图官方网址:https://www.mzitu.com/
刚接触到 BeautifulSoup,所以拿来试下效果,起伏跌宕出来效果。
具体思路?官网首页链接--> 获取分页面链接--> 通过分页面获取图片链接
看下步骤:
一、分析下页面
1.1 先确保访问正常:
头部信息:
url = "https://www.mzitu.com"
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
"Referer": "https://www.mzitu.com/101553"
}
def load_page(url):
try:
res = requests.get(url,headers=headers)
if res.status_code == 200:
print('页面请求完毕')
return res.text
except:
print('网络访问错误')
1.2 把当前页面所有 page 的 url 相关信息抓取出来
把如上页面可以抓取到当前页面所有女神合集 page。
def get_page(url):
html = requests.get(url,headers=headers)
soup = BeautifulSoup(html.text,'lxml')
#获取首页所有妹子页面
all_url = soup.find("ul",{"id":"pins"}).find_all("a")
# print(all_url)
1.3 获取详细页面 url
count = 1
for href in all_url:
count=count+1
# print(href)
if count %2 != 0:
href1 = href['href'] #查找匹配出分页面中的page链接
因为通过 for 循环得到的 href 链接是一样的,所以只取一个:取奇偶
结果如下:
1.4 得到页面链接后,即可抓页面下图片的 url。创建目录
for href2 in href:
res2 = requests.get(href1,headers=headers)
soup2 = BeautifulSoup(res2.text,'lxml')
# pict_url = soup2.find("div",{"class":"main-image"}).find("img")['src'] #图片链接
# print(pict_url)
next_pic = soup2.find_all("span")[9]
max_url = next_pic.get_text()
# print(max_url)
name = soup2.find("div",{"class":"main-image"}).find("img")['alt'] #分页面名称
os.mkdir(name)
os.chdir(name)
通过如下图的当前 page 中图片最后一张对应的 span 标签为第 9 个。
1.5 下载图片
标题获取:
图片的对应链接如下:
图片链接获取如:
for i in range(1,int(max_url)+1):
next_url = href1+'/'+str(i)
res3 = requests.get(next_url,headers=headers)
soup3 = BeautifulSoup(res3.text,'lxml')
pic_address = soup3.find("div",{"class":"main-image"}).find('img')['src']
title = soup3.find('h2')
name1 = title.get_text()
img = requests.get(pic_address,headers=headers)
with open(name1+'.jpg','wb') as f:
f.write(img.content)
大功告成:
完整代码:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# @Time : 2019/8/20 15:39
# @Author : cuijianzhe
# @File : meizitu.py
# @Software: PyCharm
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import time
import os
url = "https://www.mzitu.com"
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
"Referer": "https://www.mzitu.com/101553"
}
def load_page(url):
try:
res = requests.get(url,headers=headers)
if res.status_code == 200:
print('页面请求完毕')
return res.text
except:
print('网络访问错误')
#获取整个页面
def get_page(url):
html = requests.get(url,headers=headers)
soup = BeautifulSoup(html.text,'lxml')
#获取首页所有妹子页面
all_url = soup.find("ul",{"id":"pins"}).find_all("a")
# print(all_url)
count = 1
for href in all_url:
count=count+1
# print(href)
if count %2 != 0:
href1 = href['href'] #查找匹配出分页面中的page链接
# print(href1)
for href2 in href:
res2 = requests.get(href1,headers=headers)
soup2 = BeautifulSoup(res2.text,'lxml')
# pict_url = soup2.find("div",{"class":"main-image"}).find("img")['src'] #图片链接
# print(pict_url)
next_pic = soup2.find_all("span")[9]
max_url = next_pic.get_text()
name = soup2.find("div",{"class":"main-image"}).find("img")['alt']
os.mkdir(name) #第一张图名称作为目录
os.chdir(name)
for i in range(1,int(max_url)+1):
next_url = href1+'/'+str(i)
res3 = requests.get(next_url,headers=headers)
soup3 = BeautifulSoup(res3.text,'lxml')
pic_address = soup3.find("div",{"class":"main-image"}).find('img')['src']
title = soup3.find('h2')
name1 = title.get_text()
img = requests.get(pic_address,headers=headers)
with open(name1+'.jpg','wb') as f:
f.write(img.content)
if __name__ == '__main__':
load_page(url)
get_page(url)
欢迎来到这里!
我们正在构建一个小众社区,大家在这里相互信任,以平等 • 自由 • 奔放的价值观进行分享交流。最终,希望大家能够找到与自己志同道合的伙伴,共同成长。
注册 关于