웹에서 크롤링(crawling) 하는 방법

김숭늉이 2024. 1. 28. 14:36

728x90

크롤링 하는 방법 정리

1. 크롤링을 위한 파이썬 패키지 설지

pip install requests  
pip install beautifulsoup4

먼저 웹에서 크롤링을 하려면 파이썬 외부 패키지 BeautifulSoup과 requests 라이브러리가 필요하니 설치

❓만약에 실제회사 / 내가설치한패키지 다를경우?
그래서 프로젝트 별로 라이브러리를 모아둘수 있게 하는 가상환경 venv가 존재하는것!

2. requests 기본 셋팅 (api 형식)

import requests

# 예시 URL로 GET 요청을 보냄
url = 'url 여기에 작성하기'
r = requests.get(url)

# 응답 데이터를 JSON 형식으로 파싱하여 변수에 할당
rjson = r.json()

# rjson 변수에는 파이썬 데이터 타입으로 변환된 JSON 데이터가 들어 있음

print(rjson)

#############################
# (입맛에 맞게 코딩)
#############################

2-1) 미세먼지 체크 api 페이지에서 크롤링 하기 (예시)

import requests # requests 라이브러리 설치 필요

r = requests.get('http://spartacodingclub.shop/sparta_api/seoulair')
rjson = r.json()

gus = rjson['RealtimeCityAir']['row']

for gu in gus:
	if gu['IDEX_MVL'] < 60: #미세먼지 60이하 인 경우 출력
		print (gu['MSRSTE_NM'], gu['IDEX_MVL'])

3. requests와 bs4 기본 셋팅 (웹페이지 크롤링)

import requests
from bs4 import BeautifulSoup

# 타겟 URL을 읽어서 HTML를 받아오고,
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('여기에url주소넣기',headers=headers)

# HTML을 BeautifulSoup이라는 라이브러리를 활용해 검색하기 용이한 상태로 만듦
# soup이라는 변수에 "파싱 용이해진 html"이 담긴 상태가 됨
# 이제 코딩을 통해 필요한 부분을 추출하면됨

soup = BeautifulSoup(data.text, 'html.parser')
   
#############################
# (입맛에 맞게 코딩)
#############################

3-1) 네이버에서 뉴스 제목 크롤링 하기 (예시)

import requests
from bs4 import BeautifulSoup



headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=105', headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

#main_content > div > div._persist > div.section_headline > ul > li:nth-child(1) > div.sh_text > a
#main_content > div > div._persist > div.section_headline > ul > li:nth-child(2) > div.sh_text > a

newses = soup.select('#main_content > div > div._persist > div.section_headline > ul > li')

for news in newses:
    a_tag = news.select_one('div.sh_text > a')
    if a_tag is not None:
        news_title = a_tag.text
        print(a_tag)

728x90