一区二区午夜,午夜深夜福利网址

得來全不費(fèi)功夫，使用Python爬蟲自動(dòng)采集Cookies、URL等網(wǎng)頁數(shù)據(jù)

2023.03.21 湖南

如何使用Python爬蟲和selenium從網(wǎng)頁收集Cookies、URL數(shù)據(jù)

長按關(guān)注《Python學(xué)研大本營》，加入讀者群，分享更多精彩

有時(shí)，你可能需要從網(wǎng)頁上獲得特定的信息，并且需要在很短的時(shí)間內(nèi)從網(wǎng)頁上收集海量的數(shù)據(jù)。所以，每次都手動(dòng)搜索數(shù)據(jù)，效率是非常低的，需要大量的時(shí)間和人力，工作也會(huì)非常枯燥。那么，怎樣才能使這個(gè)工作自動(dòng)化呢？

利用Python的庫是最佳選擇，完全可以讓Python來做這些工作!

僅僅使用簡單的代碼，我們就可以用http請求獲取瀏覽器Cookie。

另外，我們還將使用Selenium采集網(wǎng)站的數(shù)據(jù)。

從Cookie開始

為了從網(wǎng)站獲取Cookie，我們可以使用 Python 的 request 包來獲取。

import requests
response = session.get('http://google.com')
print(session.cookies.get_dict())

可以得到如下輸出信息：

{'1P_JAR': '2023-03-15-10', 'AEC': 'ARSKqsKcTPjv1-XKnWKF53IUL7c9KaIfeMugVnut9UOnkVNzviBxoe9S-gA', 'NID': '511=OA7Bf8IvPENQvvH6pCLBPeKvB3-8omEfAii1a3DGoTAngBeOxm9LkMCm2iQOy921P0GPoMjZW4xAmKqrI-OIf3JLoVJX-j5RrFCfDWDteAtsZ_pqubflcqo71mnrM8vdDBZLkqj-rYBO2KfiSdl6n1pWgFiDVPkFY1fwaQQEBqI'}

這里，你還能得到諸如過期時(shí)間、域名、路徑等信息：

response.headers

{'Date': 'Wed, 15 Mar 2023 10:50:00 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP='This is not a P3P policy! See g.co/p3phelp for more info.'', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '7634', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2023-03-15-10; expires=Fri, 14-Apr-2023 10:50:00 GMT; path=/; domain=.google.com; Secure, AEC=ARSKqsKcTPjv1-XKnWKF53IUL7c9KaIfeMugVnut9UOnkVNzviBxoe9S-gA; expires=Mon, 11-Sep-2023 10:50:00 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax, NID=511=OA7Bf8IvPENQvvH6pCLBPeKvB3-8omEfAii1a3DGoTAngBeOxm9LkMCm2iQOy921P0GPoMjZW4xAmKqrI-OIf3JLoVJX-j5RrFCfDWDteAtsZ_pqubflcqo71mnrM8vdDBZLkqj-rYBO2KfiSdl6n1pWgFiDVPkFY1fwaQQEBqI; expires=Thu, 14-Sep-2023 10:50:00 GMT; path=/; domain=.google.com; HttpOnly’}

你可以通過嘗試不同的功能來探索更多的輸出。

還可以使用browser-cookies包來獲得更多的輸出。

使用此命令進(jìn)行安裝：

pip install browser-cookie3

現(xiàn)在，我們可使用以下代碼片斷來獲取Cookie。

import browser_cookie3
import requests
cj = browser_cookie3.chrome(domain_name='www.bitbucket.com')
r = requests.get(url, cookies=cj)
get_title(r.content)

將得到如下結(jié)果：

'richardpenman / home &mdash; Bitbucket’

你還可以獲得更多關(guān)于Cookies的信息，如過期時(shí)間、價(jià)值、描述等。

使用Selenium采集數(shù)據(jù)

我們可以使用Selenium來采集網(wǎng)站的任何數(shù)據(jù)。Selenium是一個(gè)開源的龐大項(xiàng)目，包括一系列工具和庫，旨在支持瀏覽器自動(dòng)化。

安裝Selenium

pip3 install selenium

現(xiàn)在，我們將使用Selenium來獲取數(shù)據(jù)，首先導(dǎo)入以下Python包：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

這里，我們使用Chrome瀏覽器來獲取數(shù)據(jù)。首先，程序在后臺打開瀏覽器，并點(diǎn)擊想要收集數(shù)據(jù)的網(wǎng)站URL。

driver_exe = 'chromedriver'
options = Options()
options.add_argument('--headless') 
options.add_argument('--start-maximized') #用網(wǎng)頁最大化模式打開瀏覽器
options.add_argument('--no-sandbox') #繞過系統(tǒng)安全設(shè)置
options.add_argument('--disable-dev-shm-usage') #取消資源限制
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(driver_exe, options=options)
driver.get(“https://www.google.com”)

現(xiàn)在，可以使用selenium的不同方法，通過Tag、class、url等獲取數(shù)據(jù)。

elements = driver.find_elements(By.TAG_NAME, 'a’)

這將得到所有帶有標(biāo)簽'a'的元素。

而后，就可以運(yùn)行自定義邏輯了，與其他數(shù)據(jù)進(jìn)行比較，或者像普通Python一樣，對數(shù)據(jù)進(jìn)行任何進(jìn)一步的處理。下面是一個(gè)例子。

for element in elements:
  url_elem = element.get_attribute('href') # 獲取url鏈接
  if url_elem == <Compare with Other URL>:
  # 業(yè)務(wù)代碼

將所有代碼結(jié)合起來

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By


driver_exe = 'chromedriver'
options = Options()
options.add_argument('--headless') 
options.add_argument('--start-maximized') #open Browser in maximized mode
options.add_argument('--no-sandbox') #bypass OS security model
options.add_argument('--disable-dev-shm-usage') #overcome limited resource problems
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(driver_exe, options=options)

driver.get(“https://www.google.com”)
elem = driver.find_elements(By.TAG_NAME, 'a’)

for ele in elem:
    url_elem = ele.get_attribute('href')
    if url_elem == <compare URL>:
        logging.debug('URL is found')
        break

推薦書單

《Python從入門到精通（第2版）》

《Python從入門到精通（第2版）》從初學(xué)者角度出發(fā)，通過通俗易懂的語言、豐富多彩的實(shí)例，詳細(xì)介紹了使用Python進(jìn)行程序開發(fā)應(yīng)該掌握的各方面技術(shù)。全書共分23章，包括初識Python、Python語言基礎(chǔ)、運(yùn)算符與表達(dá)式、流程控制語句、列表和元組、字典和集合、字符串、Python中使用正則表達(dá)式、函數(shù)、面向?qū)ο蟪绦蛟O(shè)計(jì)、模塊、異常處理及程序調(diào)試、文件及目錄操作、操作數(shù)據(jù)庫、GUI界面編程、Pygame游戲編程、網(wǎng)絡(luò)爬蟲開發(fā)、使用進(jìn)程和線程、網(wǎng)絡(luò)編程、Web編程、Flask框架、e起去旅行網(wǎng)站、AI圖像識別工具等內(nèi)容。所有知識都結(jié)合具體實(shí)例進(jìn)行介紹，涉及的程序代碼都給出了詳細(xì)的注釋，讀者可輕松領(lǐng)會(huì)Python程序開發(fā)的精髓，快速提升開發(fā)技能。除此之外，該書還附配了243集高清教學(xué)微視頻及PPT電子教案。

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點(diǎn)擊舉報(bào)。

打開APP，閱讀全文并永久保存查看更多類似文章

selenium+python自動(dòng)化100-linux搭建selenium環(huán)境

《selenium2 python 自動(dòng)化測試實(shí)戰(zhàn)》（17）——幾個(gè)cookies操作

怎樣開始寫第一個(gè)基于python的selenium腳本

selenium設(shè)置user-agent以及對于是否是瀏覽器內(nèi)核進(jìn)行反爬

Python3+Selenium 配置Chrome選項(xiàng)

selenium啟動(dòng)Chrome配置參數(shù)問題

更多類似文章 >>

九色国产,午夜在线视频,新黄色网址,九九色综合,天天做夜夜做久久做狠狠,天天躁夜夜躁狠狠躁2021a,久久不卡一区二区三区