Python 简易爬虫
1.爬整个网页
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import urllib.requestheader= { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko ) Chrome/58.0.3029.96 Safari/537.36' } #一个请求的头部,其中User-Agent用于描述浏览器类型 request = urllib.request.Request('http ://www.sina .com ',headers=header)#请求对象,请求某一网站的内容 response1 = urllib.request .urlopen ('http://www.sina .com ') #某一网站的响应 response2 = urllib.request .urlopen (request) html=response1.read () #读取响应信息的字节流 f = open('./4.txt ','wb') f.write (html) f.close () #将字节流写入到文件中
2.爬取豆瓣短评
load(url)函数,通过url传递网页爬取网页内容
1 2 3 4 5 6 7 import urllib.requestdef load(url): req = urllib.request.Request(url) res = urllib.request.urlopen(req) html = res.read() return html
write(html,txt)函数,将html内容存至txt文件中
1 2 3 4 def write(html ,t): f = open (t,'wb' ) f.write (html) f.close ()
spider(url,begin,end)函数,爬取指定页数的评论
其中https://movie.douban.com/subject/24773958/comments?start=0&limit=20&sort=new_score&status=P&percent_type=为短评首页
1 2 3 4 5 6 7 8 9 10 def spider(url,begin ,end ): for i in range(begin ,end +1 ): 20 *(i-1 ) the_url='https://movie.douban.com/subject/24773958/comments?start=' +str(i)+'&limit=20&sort=new_score&status=P&percent_type=' html = load (the_url) t = str(i)+'.html' write (html,t) print('已保存第%d页' %i)
正则表达式爬取评论内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 <p class=""> 结局简直丧出天际!灭霸竟然有内心戏! 全程下来美队和钢铁侠也没见上一面, 我还以为在世界末日前必然要重修旧好了! </p> '<p class="">(.*?)</p>' import urllib.requestimport redef load(url): req = urllib.request.Request(url) res = urllib.request.urlopen(req) html = res.read() html = html.decode('utf-8 ') pattern = re.compile('<p class="">(.*?)</div>') list = pattern.findall(html) return list
命令行操作
1 2 3 4 5 6 7 #有中文输出输入时需加下行的编码 # -*- coding: utf-8 -*- if __name__ == '__main__' : url = input ('请输入网址:' ) begin = int (input ('请输入起始页:' )) end = int (input ('请输入终止页:' )) spider(url,begin,end )