如何使用 urllib 包獲取 Internet 資源¶

作者:: Michael Foord

簡介¶

urllib.request 是一個用於獲取 URL（統一資源定位符）的 Python 模組。它提供了一個非常簡單的介面，形式為 urlopen 函式。該函式能夠使用各種不同的協議獲取 URL。它還提供了一個稍微複雜一些的介面，用於處理常見情況——如基本身份驗證、cookie、代理等。這些由稱為處理程式（handler）和 opener 的物件提供。

urllib.request 支援使用其相關的網路協議（例如 FTP、HTTP）獲取許多“URL 方案”（由 URL 中 ":" 之前的字串標識 - 例如 "ftp" 是 "ftp://python.club.tw/" 的 URL 方案）。本教程側重於最常見的 HTTP 情況。

對於簡單的情況，urlopen 非常容易使用。但是，一旦在開啟 HTTP URL 時遇到錯誤或非平凡的情況，您將需要一些對超文字傳輸協議的理解。關於 HTTP 最全面和權威的參考資料是 RFC 2616。這是一份技術文件，並非旨在易於閱讀。本 HOWTO 旨在說明如何使用 urllib，並提供足夠的關於 HTTP 的詳細資訊來幫助您。它並非旨在取代 urllib.request 文件，而是對它們的補充。

獲取 URL¶

使用 urllib.request 的最簡單方法如下

import urllib.request
with urllib.request.urlopen('https://python.club.tw/') as response:
   html = response.read()

如果您希望透過 URL 檢索資源並將其儲存在臨時位置，則可以透過 shutil.copyfileobj() 和 tempfile.NamedTemporaryFile() 函式來實現

import shutil
import tempfile
import urllib.request

with urllib.request.urlopen('https://python.club.tw/') as response:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        shutil.copyfileobj(response, tmp_file)

with open(tmp_file.name) as html:
    pass

許多 urllib 的用法都如此簡單（請注意，我們可以使用以 ‘ftp:’、‘file:’ 等開頭的 URL 來代替 ‘http:’ URL）。但是，本教程的目的是解釋更復雜的情況，重點是 HTTP。

HTTP 基於請求和響應 - 客戶端發出請求，伺服器傳送響應。urllib.request 使用 Request 物件來映象此過程，該物件表示您正在發出的 HTTP 請求。在其最簡單的形式中，您可以建立一個 Request 物件，該物件指定要獲取的 URL。使用此 Request 物件呼叫 urlopen 會返回所請求 URL 的響應物件。此響應是一個類似檔案的物件，這意味著您可以例如對響應呼叫 .read()

import urllib.request

req = urllib.request.Request('https://python.club.tw/')
with urllib.request.urlopen(req) as response:
   the_page = response.read()

請注意，urllib.request 使用相同的 Request 介面來處理所有 URL 方案。例如，您可以像這樣發出 FTP 請求

req = urllib.request.Request('ftp://example.com/')

在 HTTP 的情況下，Request 物件允許您執行另外兩件事：首先，您可以傳遞要傳送到伺服器的資料。其次，您可以將有關資料或請求本身的額外資訊（“元資料”）傳遞給伺服器——此資訊作為 HTTP “標頭”傳送。讓我們依次檢視這些內容。

資料¶

有時您想將資料傳送到 URL（通常 URL 將引用 CGI（公共閘道器介面）指令碼或其他 Web 應用程式）。在 HTTP 中，這通常使用所謂的 POST 請求來完成。這通常是您在 Web 上提交填寫的 HTML 表單時瀏覽器所做的操作。並非所有 POST 都必須來自表單：您可以使用 POST 將任意資料傳輸到您自己的應用程式。在 HTML 表單的常見情況下，資料需要以標準方式進行編碼，然後作為 data 引數傳遞給 Request 物件。編碼是使用 urllib.parse 庫中的函式完成的。

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }

data = urllib.parse.urlencode(values)
data = data.encode('ascii') # data should be bytes
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

請注意，有時需要其他編碼（例如，用於從 HTML 表單上傳檔案 - 有關更多詳細資訊，請參閱 HTML 規範，表單提交）。

如果您不傳遞 data 引數，則 urllib 使用 GET 請求。GET 和 POST 請求之間的一個區別是，POST 請求通常具有“副作用”：它們以某種方式更改系統的狀態（例如，透過向網站下訂單，將一百磅罐裝午餐肉送到您家門口）。儘管 HTTP 標準明確指出 POST 旨在始終導致副作用，而 GET 請求永不導致副作用，但沒有任何東西可以阻止 GET 請求產生副作用，也沒有任何東西可以阻止 POST 請求沒有副作用。資料也可以透過將其編碼在 URL 本身中來在 HTTP GET 請求中傳遞。

這按如下方式完成

>>> import urllib.request
>>> import urllib.parse
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.parse.urlencode(data)
>>> print(url_values)  # The order may differ from below.  
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib.request.urlopen(full_url)

請注意，完整的 URL 是透過在 URL 中新增 ?，後跟編碼的值來建立的。

標頭¶

我們將在此處討論一個特定的 HTTP 標頭，以說明如何將標頭新增到 HTTP 請求中。

一些網站 [1] 不喜歡被程式瀏覽，或者向不同的瀏覽器傳送不同的版本 [2]。預設情況下，urllib 將自己標識為 Python-urllib/x.y （其中 x 和 y 是 Python 版本的主版本號和次版本號，例如 Python-urllib/2.5），這可能會使站點混淆，或者根本不起作用。瀏覽器標識自己的方式是透過 User-Agent 標頭 [3]。當您建立 Request 物件時，您可以傳入一個標頭字典。以下示例執行與上述相同的請求，但將自己標識為 Internet Explorer 的版本 [4]。

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name': 'Michael Foord',
          'location': 'Northampton',
          'language': 'Python' }
headers = {'User-Agent': user_agent}

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

響應還有兩個有用的方法。請參閱info 和 geturl 部分，該部分在我們查看出現問題時會發生什麼之後。

處理異常¶

當 urlopen 無法處理響應時，會引發 URLError （但與 Python API 的常見情況一樣，也可能引發諸如 ValueError、 TypeError 等內建異常）。

HTTPError 是在 HTTP URL 的特定情況下引發的 URLError 的子類。

異常類是從 urllib.error 模組匯出的。

URLError¶

通常，引發 URLError 是因為沒有網路連線（沒有到指定伺服器的路由），或者指定的伺服器不存在。在這種情況下，引發的異常將具有一個 ‘reason’ 屬性，該屬性是一個包含錯誤程式碼和文字錯誤訊息的元組。

例如

>>> req = urllib.request.Request('http://www.pretend_server.org')
>>> try: urllib.request.urlopen(req)
... except urllib.error.URLError as e:
...     print(e.reason)      
...
(4, 'getaddrinfo failed')

HTTPError¶

伺服器返回的每個 HTTP 響應都包含一個數字“狀態碼”。有時，狀態碼錶示伺服器無法滿足請求。預設的處理程式會為您處理某些此類響應（例如，如果響應是請求客戶端從不同的 URL 獲取文件的“重定向”，則 urllib 會為您處理）。對於它無法處理的響應，urlopen 將引發一個 HTTPError。常見的錯誤包括 ‘404’（頁面未找到）、‘403’（請求被禁止）和 ‘401’（需要身份驗證）。

有關所有 HTTP 錯誤程式碼的參考，請參閱 RFC 2616 的第 10 節。

引發的 HTTPError 例項將具有一個整數 ‘code’ 屬性，該屬性對應於伺服器傳送的錯誤。

錯誤程式碼¶

由於預設處理程式處理重定向（300 範圍內的程式碼），並且 100-299 範圍內的程式碼表示成功，因此您通常只會看到 400-599 範圍內的錯誤程式碼。

http.server.BaseHTTPRequestHandler.responses 是一個有用的響應程式碼字典，其中顯示了 RFC 2616 使用的所有響應程式碼。為了方便起見，這裡重現了該字典

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
    100: ('Continue', 'Request received, please continue'),
    101: ('Switching Protocols',
          'Switching to new protocol; obey Upgrade header'),

    200: ('OK', 'Request fulfilled, document follows'),
    201: ('Created', 'Document created, URL follows'),
    202: ('Accepted',
          'Request accepted, processing continues off-line'),
    203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
    204: ('No Content', 'Request fulfilled, nothing follows'),
    205: ('Reset Content', 'Clear input form for further input.'),
    206: ('Partial Content', 'Partial content follows.'),

    300: ('Multiple Choices',
          'Object has several resources -- see URI list'),
    301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
    302: ('Found', 'Object moved temporarily -- see URI list'),
    303: ('See Other', 'Object moved -- see Method and URL list'),
    304: ('Not Modified',
          'Document has not changed since given time'),
    305: ('Use Proxy',
          'You must use proxy specified in Location to access this '
          'resource.'),
    307: ('Temporary Redirect',
          'Object moved temporarily -- see URI list'),

    400: ('Bad Request',
          'Bad request syntax or unsupported method'),
    401: ('Unauthorized',
          'No permission -- see authorization schemes'),
    402: ('Payment Required',
          'No payment -- see charging schemes'),
    403: ('Forbidden',
          'Request forbidden -- authorization will not help'),
    404: ('Not Found', 'Nothing matches the given URI'),
    405: ('Method Not Allowed',
          'Specified method is invalid for this server.'),
    406: ('Not Acceptable', 'URI not available in preferred format.'),
    407: ('Proxy Authentication Required', 'You must authenticate with '
          'this proxy before proceeding.'),
    408: ('Request Timeout', 'Request timed out; try again later.'),
    409: ('Conflict', 'Request conflict.'),
    410: ('Gone',
          'URI no longer exists and has been permanently removed.'),
    411: ('Length Required', 'Client must specify Content-Length.'),
    412: ('Precondition Failed', 'Precondition in headers is false.'),
    413: ('Request Entity Too Large', 'Entity is too large.'),
    414: ('Request-URI Too Long', 'URI is too long.'),
    415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
    416: ('Requested Range Not Satisfiable',
          'Cannot satisfy request range.'),
    417: ('Expectation Failed',
          'Expect condition could not be satisfied.'),

    500: ('Internal Server Error', 'Server got itself in trouble'),
    501: ('Not Implemented',
          'Server does not support this operation'),
    502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
    503: ('Service Unavailable',
          'The server cannot process the request due to a high load'),
    504: ('Gateway Timeout',
          'The gateway server did not receive a timely response'),
    505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
    }

當引發錯誤時，伺服器透過返回 HTTP 錯誤程式碼和錯誤頁面來響應。您可以使用 HTTPError 例項作為返回頁面上的響應。這意味著除了 code 屬性之外，它還具有 urllib.response 模組返回的 read、geturl 和 info 方法。

>>> req = urllib.request.Request('https://python.club.tw/fish.html')
>>> try:
...     urllib.request.urlopen(req)
... except urllib.error.HTTPError as e:
...     print(e.code)
...     print(e.read())  
...
404
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
  ...
  <title>Page Not Found</title>\n
  ...

總結¶

因此，如果您想為 HTTPError 或 URLError 做好準備，有兩種基本方法。我更喜歡第二種方法。

方法 1¶

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request(someurl)
try:
    response = urlopen(req)
except HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    # everything is fine

注意

except HTTPError 必須放在首位，否則 except URLError 也會捕獲 HTTPError。

方法 2¶

from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request(someurl)
try:
    response = urlopen(req)
except URLError as e:
    if hasattr(e, 'reason'):
        print('We failed to reach a server.')
        print('Reason: ', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn\'t fulfill the request.')
        print('Error code: ', e.code)
else:
    # everything is fine

info 和 geturl¶

urlopen 返回的響應（或 HTTPError 例項）有兩個有用的方法 info() 和 geturl()，並在 urllib.response 模組中定義。

geturl - 這會返回獲取的頁面的真實 URL。這很有用，因為 urlopen（或使用的 opener 物件）可能已跟隨重定向。獲取的頁面的 URL 可能與請求的 URL 不同。
info - 這會返回一個類似字典的物件，用於描述獲取的頁面，特別是伺服器傳送的標頭。它當前是一個 http.client.HTTPMessage 例項。

常見的標頭包括 ‘Content-length’、‘Content-type’ 等。有關 HTTP 標頭的有用列表及其含義和用法的簡要說明，請參閱 HTTP 標頭快速參考。

Openers 和 Handlers¶

當您獲取 URL 時，您會使用 opener（一個名為 urllib.request.OpenerDirector 的例項，這可能令人困惑）。通常，我們一直在使用預設的 opener - 透過 urlopen - 但您可以建立自定義的 openers。Openers 使用 handlers。所有“繁重的工作”都由 handlers 完成。每個 handler 都知道如何為特定的 URL 方案（http、ftp 等）開啟 URL，或者如何處理 URL 開啟的某個方面，例如 HTTP 重定向或 HTTP cookie。

如果您想使用安裝了特定 handlers 的 openers 獲取 URL，例如要獲取一個處理 cookie 的 opener，或者要獲取一個不處理重定向的 opener，則需要建立 openers。

要建立 opener，請例項化一個 OpenerDirector，然後重複呼叫 .add_handler(some_handler_instance)。

或者，您可以使用 build_opener，這是一個方便的函式，用於透過單個函式呼叫建立 opener 物件。build_opener 預設新增多個 handlers，但提供了一種快速新增更多 handlers 和/或覆蓋預設 handlers 的方法。

您可能需要的其他型別的 handlers 可以處理代理、身份驗證和其他常見但略微特殊的情況。

install_opener 可用於使 opener 物件成為（全域性）預設 opener。這意味著對 urlopen 的呼叫將使用您安裝的 opener。

Opener 物件具有一個 open 方法，可以直接呼叫該方法以與 urlopen 函式相同的方式獲取 URL：除非為了方便起見，否則無需呼叫 install_opener。

基本身份驗證¶

為了說明建立和安裝 handler，我們將使用 HTTPBasicAuthHandler。有關此主題的更詳細討論（包括對基本身份驗證工作原理的解釋），請參閱基本身份驗證教程。

當需要身份驗證時，伺服器會發送一個標頭（以及 401 錯誤程式碼），請求身份驗證。這指定了身份驗證方案和一個“realm”。標頭如下所示：WWW-Authenticate: SCHEME realm="REALM"。

例如

WWW-Authenticate: Basic realm="cPanel Users"

然後，客戶端應重試請求，並在請求中的標頭中包含該 realm 的相應名稱和密碼。這就是“基本身份驗證”。為了簡化此過程，我們可以建立一個 HTTPBasicAuthHandler 的例項和一個 opener 來使用此 handler。

HTTPBasicAuthHandler 使用一個名為密碼管理器的物件來處理 URL 和 realm 到密碼和使用者名稱的對映。如果您知道 realm 是什麼（來自伺服器傳送的身份驗證標頭），則可以使用 HTTPPasswordMgr。通常，人們並不關心 realm 是什麼。在這種情況下，使用 HTTPPasswordMgrWithDefaultRealm 會很方便。這允許您為 URL 指定預設使用者名稱和密碼。如果您沒有為特定 realm 提供替代組合，則將提供此使用者名稱和密碼。我們透過向 add_password 方法提供 None 作為 realm 引數來表明這一點。

頂級 URL 是第一個需要身份驗證的 URL。與您傳遞給 .add_password() 的 URL 相比，“更深層”的 URL 也將匹配。

# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)

handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL
opener.open(a_url)

# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)

注意

在上面的示例中，我們僅將 HTTPBasicAuthHandler 提供給 build_opener。預設情況下，openers 具有用於正常情況的 handlers – ProxyHandler（如果設定了代理設定，例如 http_proxy 環境變數）、UnknownHandler、HTTPHandler、HTTPDefaultErrorHandler、HTTPRedirectHandler、FTPHandler、FileHandler、DataHandler、HTTPErrorProcessor。

top_level_url 實際上是要麼一個完整的 URL（包括 'http:' 協議部分、主機名以及可選的埠號），例如 "http://example.com/" 要麼一個“權威”（即主機名，可選地包括埠號），例如 "example.com" 或 "example.com:8080"（後一個示例包括埠號）。如果存在，權威部分必須不包含“userinfo”元件 - 例如 "joe:password@example.com" 是不正確的。

代理¶

urllib 將自動檢測您的代理設定並使用它們。這是透過 ProxyHandler 實現的，它是檢測到代理設定時正常處理鏈的一部分。通常這是一件好事，但在某些情況下可能沒有幫助 [5]。一種方法是設定我們自己的 ProxyHandler，不定義任何代理。這與設定基本身份驗證處理程式類似。

>>> proxy_support = urllib.request.ProxyHandler({})
>>> opener = urllib.request.build_opener(proxy_support)
>>> urllib.request.install_opener(opener)

注意

目前 urllib.request 不支援透過代理獲取 https 位置。但是，可以透過擴充套件 urllib.request 來啟用此功能，如配方 [6] 中所示。

注意

如果設定了變數 REQUEST_METHOD，則會忽略 HTTP_PROXY；請參閱關於 getproxies() 的文件。

套接字和層¶

Python 對從 Web 獲取資源的支援是分層的。urllib 使用 http.client 庫，該庫又使用套接字型檔。

從 Python 2.3 開始，您可以指定套接字在超時之前應等待響應的時間。這在必須獲取網頁的應用程式中很有用。預設情況下，套接字模組沒有超時，可能會掛起。目前，套接字超時在 http.client 或 urllib.request 級別未公開。但是，您可以使用以下命令為所有套接字全域性設定預設超時：

import socket
import urllib.request

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('http://www.voidspace.org.uk')
response = urllib.request.urlopen(req)

腳註¶

本文件由 John Lee 審閱和修訂。