[데이터분석] 데이터 크롤링 실습(1)

[데이터분석] 데이터 크롤링 실습(1)

2024. 8. 14. 00:16ㆍ데이터 사이언스

파이참에 환경 세팅

selenium 다운로드

pip install selenium
pip install --upgrade pip

homebrew 로 Mac에 chromedriver 설치

brew install chromedriber

BeautifulSoup

1. 라이브러리 가져오기

# beautifulSoup 라이브러리 가져오기
import requests
from bs4 import BeautifulSoup

2. robots.txt 읽어보기

(예) 다나와 사이트 : https://danawa.com/robots.txt → url 뒤에 /robots.txt 붙이기

robots.txt : 법적 효력이 없기 때문에, 수정사항이 있어도 반영되지 않을 수 있음
모든 User-agent 에 대해 /user_report/, /elec/Management 에 대해 허용 안 함
HMSE_Robot 은 아무것도 허용하지 않고, bingbot은 크롤링 딜레이를 발생시킴
아래의 유저들에 대해서 크롤링 딜레이를 1씩 발생시킴

User-agent: Mediapartners-Google
User-agent: Googlebot
User-agent: Googlebot-image
User-agent: NaverBot
User-agent: Yeti
User-agent: Daumoa
User-agent: Twitterbot

3. 태그의 종류를 확인하는 방법

chrome - 개발자 도구 - Elements → 아이콘 클릭 후, 원하는 태그에 마우스를 올리기

1. Header를 입력하지 않은 경우

# header를 입력하지 않은 경우
# requests 라이브러리를 통해서 가져오기
res = requests.get('<https://danawa.com/>')

# BeatufiulSoup으로 파싱
soup = BeautifulSoup(res.content, 'html.parser')
print(soup)

requests.get('...') : requests 라이브러리를 통해서 가져오기
soup = BeautifulSoup(res.content, 'html.parser')
- requests 라이브러리 통해서 가져온 변수이름.content
- 무엇으로 파싱하는지 넣어줘야함 -> 일반적으로 html을 파싱하고 지금 우리는 html을 파싱하니 html.parser를 넣어줌
print(soup) 결과 → Error 발생 → 다나와 사이트가 막아둔 것
- beautifulSoup이 Selenium보다 너무 로봇같이 보여서 → 유저와 구분이 쉽기 때문에 막아둔 곳이 많음

 # p 태그를 가진 원소 모두 추출
 soup.select("p")
 
 # span 태그를 가진 title이 class인 원소 모두 추출 "." 입력
soup.select("span.title")

# select_one은 최초 한개만 추출
# get_text()를 통해 text만 가져올 수 있음
soup.select_one("span.title").get_text()

# a 태그를 가진 class명 text-elps2인 원소 모두 추출 "." 입력
soup.select("a.text-elps2")

select : 추출 → 추출한 내용이 list 형태
- 따라서 list와 같은 방식으로 행을 선택하여 추출 가능
- (ex) soup_select(”span.title”)[-1] : 마지막 행 추출
select_one : 최초 한 개만 추출
get_text() : text만 추출

2. Header를 입력한 경우

봇이 아닌 사람 처럼 Header를 주는 경우, Header를 주지 않은 경우 보다 추출이 되는 경우가 더 있을 수 있지만, 이 역시 beautifulSoup은 탐지가 쉽기 때문에 Header를 입력해도 추출이 안 되는 경우가 많음

# beautifulSoup 라이브러리 가져오기
import requests
from bs4 import BeautifulSoup

#header 입력
headers = {
	"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
}

#requests 라이브러리를 통해서 가져오기 (header 사용)
res = requests.get('<https://prod.danawa.com/list/?cate=112758&15main_11_02=>', headers = headers)

#BeatufiulSoup으로 파싱
soup = BeautifulSoup(res.content, 'html.parser')

headers = { "User-Agent":"..."}
- header 입력 → 사람같은 정보를 주는 것 (ex) “apple 사용하고 있고, chrome 유저야”

#a 태그 가진 prod_name을 클래스명으로 하는 원소 모두 추출
soup.select("a.prod_name")

Selenium

[파이참] ChromeDriver 기본 설정 - selenium driver 로드

import selenium
print(selenium.__version__)

from selenium import webdriver
from selenium.webdriver.common.by import By

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')  # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')  # set path to chromedriver as per your configuration
chrome_options.add_argument('lang=ko_KR')  # 한국어
chrome_options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')

# selenium driver 로드
driver = webdriver.Chrome(options=chrome_options)

# 링크 전달
driver.get("<https://prod.danawa.com/list/?cate=112758&15main_11_02=>")

# CSS_SELECTOR를 활용해서 원소들 갖고오기 (find_element 하면 한개만 가져옴)
elements = driver.find_elements(By.CSS_SELECTOR, "a.prod_name")

# elements의 모든 원소에 대해서
for elem in elements:
    # elem의 text만 뽑아서 print
    print(elem.text)

# driver 종료
driver.quit()

driver.get("https://...") : driver 라이브러리를 통해서 가져오기
elements = driver.find_elements(By.CSS_SELECTOR, "a.prod_name")
- CSS_SELECTOR를 활용해서 원소들 갖고오기
- find_elements 여러개 가져오기
- find_element 하면 한개만 가져옴
- [방법2] 우클릭 - Copy - Copt selector : CSS_SELECTOR 구문에 맞게 태그를 가져와줌

elements = driver.find_elements(By.CSS_SELECTOR, "#mainAdReader > ul > li:nth-child(1) > div > div.prod_info > p.prod_name > a")

.text : text만 뽑아서 print

#By.NAME을 활용해서 원소들 갖고오기 (find_element 하면 한개만 가져옴)
elements = driver.find_elements(By.NAME, "productName")

By.NAME : By. NAME을 활용하여 원소들 갖고 오기

#CSS_SELECTOR를 활용해서 p 태그를 갖는 모든 원소들 갖고오기 (find_element 하면 한개만 가져옴)
elements = driver.find_elements(By.CSS_SELECTOR, "p")

#CSS_SELECTOR를 활용해서 원소들 갖고오기 (find_element 하면 한개만 가져옴)
elements = driver.find_elements(By.CSS_SELECTOR, "a.text-elps2")

Selenium - BeautifulSoup과 연계

많은 데이터를 가져오는 경우 : BeautifulSoup이 더 빠르기 때문에, 가져오는 것은 Selenium으로 가져오고 파싱을 BeautifulSoup으로 하면 속도가 좀 더 빨라질 수 있음

from selenium import webdriver
from bs4 import BeautifulSoup

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')  # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')  # set path to chromedriver as per your configuration
chrome_options.add_argument('lang=ko_KR')  # 한국어
chrome_options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')

driver = webdriver.Chrome(options=chrome_options)

driver.get("<https://prod.danawa.com/list/?cate=112758&15main_11_02=>")

# 페이지 소스 가져오기
html_source = driver.page_source

# BeautifulSoup으로 파싱
soup = BeautifulSoup(html_source, 'html.parser')

# 원하는 요소 찾기
items = soup.select('a[name="productName"]')

for item in items:
  print(item.text)

driver.quit()

저작자표시 비영리 변경금지

'데이터 사이언스' 카테고리의 다른 글

[머신러닝] 주식 종목 추천 시스템 - (1) yfinance 라이브러리 설치 & Ticker란? (0)	2024.08.14
[데이터분석] 데이터 크롤링 실습(2) - 인프런 크롤링 (0)	2024.08.14
[데이터분석] 데이터 크롤링 (0)	2024.08.14
[머신러닝] - 머신러닝의 개념 및 종류 (0)	2024.08.13
[데이터분석] 타이타닉 데이터셋 전처리 실습 (0)	2024.08.13

hhongyeahh

hhongyeahh

태그

최근글

댓글

공지사항

아카이브

파이참에 환경 세팅

BeautifulSoup

1. Header를 입력하지 않은 경우

2. Header를 입력한 경우

Selenium

Selenium - BeautifulSoup과 연계

'데이터 사이언스' 카테고리의 다른 글

관련글

티스토리툴바