urllib是Python 3用于操作URL的模块。如果你用过Python2,那么你应该知道在Python2中有urllib和urllib2两个模块。你可能对于Python中两个独立存在的urllib和urllib2感到好奇,其实它们并不可以相互代替,2并不是1的升级版,这也是混合使用它们的原因。
Python 3的urllib模块合并了Python2中urllib和urllib2这两个模块的功能。
urllib包含的模块:
- urllib.request
- urllib.error
- urllib.parse
- urllib.rebotparser
官方的urllib的文档建议使用高级的requests模块。
urllib.request
urllib.request模块主要用来打开下载网页,一个例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# python3 Python 3.5.1+ (default, Jun 10 2016, 09:03:40) [GCC 5.4.0 20160603] on linux Type "help", "copyright", "credits" or "license" for more information. >>> >>> import urllib >>> from urllib import request >>> response = request.urlopen('https://www.github.com') # HTTPResponse对象 >>> response.geturl() 'https://github.com/' >>> header = response.info() >>> print(header) Server: GitHub.com Date: Sat, 02 Jul 2016 07:06:57 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: close Status: 200 OK Cache-Control: no-cache Vary: X-PJAX X-UA-Compatible: IE=Edge,chrome=1 Set-Cookie: _octo=GH1.1.425353898.1467443217; domain=.github.com; path=/; expires=Mon, 02 Jul 2018 07:06:57 -0000 Set-Cookie: logged_in=no; domain=.github.com; path=/; expires=Wed, 02 Jul 2036 07:06:57 -0000; secure; HttpOnly Set-Cookie: _gh_sess=eyJzZXNzaW9uX2lkIjoiM2EzOWFjZDdhNDZiZTVlY2JkOGU5ZGM0YjhhYTlmZDciLCJfY3NyZl90b2tlbiI6IlV4ci9rT25NZTFTUUhVZlE5cm01b0Jrb0hNQmxSSkFjOXBoZndNOUJ6ams9In0%3D--d543ddc9f2d0e6519952ca10c54afd9cd362f77d; path=/; secure; HttpOnly X-Request-Id: 817aa77dded0174eaa69bf5eb787c4bc X-Runtime: 0.010360 Content-Security-Policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; child-src render.githubusercontent.com; connect-src 'self' uploads.github.com status.github.com api.github.com www.google-analytics.com github-cloud.s3.amazonaws.com wss://live.github.com; font-src assets-cdn.github.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com; img-src 'self' data: assets-cdn.github.com identicons.github.com www.google-analytics.com collector.githubapp.com *.gravatar.com *.wp.com *.githubusercontent.com; media-src 'none'; object-src assets-cdn.github.com; plugin-types application/x-shockwave-flash; script-src assets-cdn.github.com; style-src 'unsafe-inline' assets-cdn.github.com Strict-Transport-Security: max-age=31536000; includeSubdomains; preload Public-Key-Pins: max-age=5184000; pin-sha256="WoiWRyIOVNa9ihaBciRSC7XHjliYS9VwUGOIud4PB18="; pin-sha256="RRM1dGqnDFsCJXBTHky16vi1obOlCgFFn/yOhI/y+ho="; pin-sha256="k2v657xBsOVe1PQRwOsHsw3bsGT2VzIqz5K+59sNQws="; pin-sha256="K87oWBWM9UZfyddvDfoxL+8lpNyoUB2ptGtn0fv6G2Q="; pin-sha256="IQBnNBEiFuhj+8x6X8XLgh01V9Ic5/V3IRQLNFFc7v4="; pin-sha256="iie1VXtL7HzAMF+/PVPR9xzT80kQxdZeJ+zduCB3uj0="; pin-sha256="LvRiGEjRqfzurezaWuj8Wie2gyHMrW5Q06LspMnox7A="; includeSubDomains X-Content-Type-Options: nosniff X-Frame-Options: deny X-XSS-Protection: 1; mode=block Vary: Accept-Encoding X-Served-By: 9310b4b2914df40821f404edb55d7eb6 X-GitHub-Request-Id: 1BBD4480:65A9:5AE1533:57776810 >>> response.getcode() 200 >>> html_data = response.read() >>> print(html_data) b'\n\n\n<!DOCTYPE html>\n<html lang="en" class="">\n <head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# object: http://ogp.me/ns/object# article: http://ogp.me/ns/article# profile: http://ogp.me/ns/profile#">\n..................................... |
urlopen默认使用GET请求,要想使用POST请求,需要指定data参数。
下载文件,例如下载本博客的一个图片:
#1
1 2 3 4 5 6 7 8 9 |
import urllib from urllib import request url = 'http://blog.topspeedsnail.com/wp-content/uploads/2016/06/S60625-093856.jpg' response = request.urlopen(url) data = response.read() with open('pic.jpg', 'wb') as f: f.write(data) |
#2 还可以使用urlretrieve方法
1 2 3 4 5 6 7 8 9 10 |
import urllib from urllib import request url = 'http://blog.topspeedsnail.com/wp-content/uploads/2016/06/S60625-093856.jpg' myfile, header = request.urlretrieve(url) with open('pic.jpg', 'wb') as f: with open(myfile, 'rb') as tmp: f.write(tmp.read()) |
或
1 2 3 4 5 6 |
import urllib from urllib import request url = 'http://blog.topspeedsnail.com/wp-content/uploads/2016/06/S60625-093856.jpg' request.urlretrieve(url, 'pic.jpg') |
指定使用User Agent
1 2 3 4 5 6 7 8 9 10 11 12 |
import urllib from urllib import request user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/601.6.17 (KHTML, like Gecko) Version/9.1.1 Safari/601.6.17' url = 'https://www.github.com/' headers = {'User-Agent': user_agent} req = request.Request(url, headers=headers) response = request.urlopen(req) with open('github.html', 'wb') as f: f.write(response.read()) |
urllib.parse
urllib.parse是用来解析url字符串的,可以使用它分解或合并url字符串。一个小例子:
一个例子:
1 2 3 4 5 6 7 8 9 10 |
>>> import urllib >>> from urllib import parse >>> res = parse.urlparse('https://www.google.com/search?q=Python&newwindow=1&biw=1380&bih=714&noj=1&tbas=0&source=lnt&sa=X&ved=0ahUKEwj4_uGb8tPNAhUDT48KHQtbDSw4HhCnBQgV') >>> print(res) ParseResult(scheme='https', netloc='www.google.com', path='/search', params='', query='q=Python&newwindow=1&biw=1380&bih=714&noj=1&tbas=0&source=lnt&sa=X&ved=0ahUKEwj4_uGb8tPNAhUDT48KHQtbDSw4HhCnBQgV', fragment='') >>> res.netloc 'www.google.com' >>> res.geturl() |
上面是google搜索关键字Python的URL。
假如要搜索关键字“abc”:
1 2 3 4 5 6 7 8 9 10 11 12 |
>>> import urllib >>> from urllib import request >>> from urllib import parse >>> data = parse.urlencode({'q': 'abc'}) >>> print(data) 'q=abc' >>> url = 'https://www.google.com/search/' >>> full_url = url + '?' + data >>> print(full_url) 'https://www.google.com/search/?q=abc' |
urllib.robotparser
robotparser模块只分析处理一种文件robot.txt。例子:
1 2 3 4 5 6 7 8 9 10 |
>>> import urllib >>> from urllib import robotparser >>> robot = robotparser.RobotFileParser() >>> robot.set_url('http://blog.topspeedsnail.com/robots.txt') >>> robot.read() >>> robot.can_fetch('*', 'http://blog.topspeedsnail.com/') True >>> robot.can_fetch('*', 'http://blog.topspeedsnail.com/wp-admin/') False |