1 Star 0 Fork 0

mr_nobody/py_crawler

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
crawler.py 1.86 KB
一键复制 编辑 原始数据 按行查看 历史
mr_nobody 提交于 2017-03-30 18:01 . ..
#!/usr/bin/python3
# coding: utf-8
"""
__title__ = "my crawler"
__author__ = "Hans"
__mtime__ = "2017/3/30 0030"
# code is far away from bugs with the god animal protecting
I love animals. They taste delicious.
┏┓ ┏┓
┏┛┻━━━┛┻┓
┃ ☃ ┃
┃ ┳┛ ┗┳ ┃
┃ ┻ ┃
┗━┓ ┏━┛
┃ ┗━━━┓
┃ 神兽保佑 ┣┓
┃ 永无BUG! ┏┛
┗┓┓┏━┳┓┏┛
┃┫┫ ┃┫┫
┗┻┛ ┗┻┛
"""
from bs4 import BeautifulSoup
import socket
import urllib.parse
site = 'www.quanjing.com'
sock = socket.socket()
# sock.setblocking(False)
sock.connect((site, 80))
get = 'GET / HTTP/1.0\r\nHost: {}\r\n\r\n'.format(site)
sock.send(get.encode('utf-8'))
response = b''
while True:
chunk = sock.recv(4096)
print(type(chunk))
if not chunk:
break
response += chunk
# 将response内容进行拆分
header, body = response.split(b'\r\n\r\n', 1)
# print(body.decode())
# with open('index.html', 'wb') as f:
# f.write(response)
soup = BeautifulSoup(body, 'lxml')
# print(soup.prettify())
# print(''.join(soup.find_all('img')))
# print(str(soup.find_all('img')[0]))
imgs = soup.find_all('img')
print(len(imgs))
# with open('image.html', 'w') as f:
# for img in imgs:
# print(type(img))
# f.write(str(img.encoding('utf-8'), encoding="utf-8"))
# f.write('<br><hr>')
# f.write('\r\n')
for img in imgs:
print(urllib.parse.unquote(img['src']))
links = soup.find_all('a')
print('-'*100)
for link in links:
try:
print(urllib.parse.unquote(link['href']))
except:
pass
# url = urllib.parse.unquote(link['href'])
# p_url = urllib.parse.urlparse(url)
# print(p_url)
# with open('image.html', 'w') as f:
# for link in links:
# try:
# f.write(str(link))
# f.write('\r\n')
# except:
# pass
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Python
1
https://gitee.com/mr_nobody/py_crawler.git
[email protected]:mr_nobody/py_crawler.git
mr_nobody
py_crawler
py_crawler
master

搜索帮助