关于urllib的使用与补充,在爬取的过程中要尽量做到浏览器的模拟仿真。如果在爬取的过程中不进行仿真处理,目标浏览器会识别出你是爬虫,从而把你拒之门外。
类的使用
1 2 3
| import urllib.request urllib.request.urlopen(文本网址,*data,*timeout) print(response.status.getheader(""))
|
获取一个get请求
1 2
| response = urllib.request.urlopen("http://www.baidu.com/get") print(response.read().decode('utf-8'))
|
获取一个post请求(同get用)
1 2 3 4
| import urllib.parse data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding = "utf-8") response = urllib.request.urlopen("http://httpbin.org/post",data = data) print(response.read().decode('utf-8'))
|
对一个无法打开的网址进行中断(超时处理,0.01s太短,这里只是举例),可用于多个网页爬取。
1 2 3 4 5
| try: response = urllib.requset.urlopen("http:/httpbin.org/get",timeout = 0.01) print(response.read().decode('utf-8')) except urllib.error.URLError as e: print("time out!")
|
模仿浏览器完整操作(豆瓣网为例)
1 2 3 4 5 6
| url = "https://www.douban.com" headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.57"} data = bytes(urllib.parse.urlencode({'name':'eric'}),encoding="utf-8") req = urllib.request.Request(url = url,data = data,headers = headers,method = "POST") response = urllib.requset.urlopen(req) print(response.read().decode('utf-8'))
|
关于请求头文件的获取
在打开浏览器时按下F12如下图:
刚开始写文章,多多包涵,如有错误,还请指正。