Python

๊ธฐ์‚ฌ ์›น์Šคํฌ๋ž˜ํ•‘(ํฌ๋กค๋ง)ํ•ด์„œ ์ด๋ฉ”์ผ ๋ณด๋‚ด๊ธฐ

๋…ธ๋ฃจ๋ฃฝ 2020. 10. 13. 21:36

๊ธฐ์‚ฌ ์›น์Šคํฌ๋ž˜ํ•‘(ํฌ๋กค๋ง)ํ•˜๊ธฐ

ํฌ๋กค๋ง ํ•˜๊ณ ์‹ถ์€ ๋ถ€๋ถ„์„ ์šฐํด๋ฆญ > ๊ฒ€์‚ฌ > Copy > Copy selector

์›น์‚ฌ์ดํŠธ ๋งˆ๋‹ค HTML ๊ตฌ์กฐ๊ฐ€ ๋‹ค๋ฅด๊ธฐ๋•Œ๋ฌธ์— ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์–ป๋Š” ๋ฐฉ๋ฒ•๋งŒ ์•Œ๋ฉด ๐Ÿ‘Œ

 

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('chromedriver')

url = "https://search.naver.com/search.naver?where=news&sm=tab_jum&query=์ถ”์„"

driver.get(url)
req = driver.page_source
soup = BeautifulSoup(req, 'html.parser')

articles = soup.select('#main_pack > div.news.mynews.section._prs_nws > ul > li')

for article in articles:
    title = article.select_one('dl > dt > a').text
    url = article.select_one('dl > dt > a')['href']
    comp = article.select_one('span._sp_each_source').text.split(" ")[0].replace('์–ธ๋ก ์‚ฌ','')
    print(title, url, comp)

driver.quit()

๋„ค์ด๋ฒ„์—์„œ '์ถ”์„'์„ ๊ฒ€์ƒ‰ํ•œ ๋‰ด์ŠคํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋งํ•ด์™”๋‹ค.

๋‰ด์Šค ๊ธฐ์‚ฌ ๋ชฉ๋ก์€ ul > li๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์–ด์„œ select๋ฅผ ํ†ตํ•ด ๊ฐ€์ ธ์™”๋‹ค.

๋Œ€์ฒด๋กœ HTML ๊ตฌ์กฐ๋งŒ ์ž˜ ํŒŒ์•…ํ•ด์„œ Selector๋“ค์„ ๋ถ„์„ํ•˜๋ฉฐ ํฌ๋กค๋ง ํ•ด์˜ค๋ฉด ๋œ๋‹ค.

์–ธ๋ก ์‚ฌ๋ฅผ ๊ฐ€์ ธ์˜ฌ ๋• ์“ธ๋ฐ ์—†๋Š” ํ…์ŠคํŠธ๋Š” split์„ ํ†ตํ•ด ์ œ๊ฑฐํ•˜๊ณ , '์–ธ๋ก ์‚ฌ' ํ…์ŠคํŠธ๋„ ์ œ๊ฑฐํ–ˆ๋‹ค.

 

์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ธฐ

openpyxl์€ ํŒŒ์ด์ฌ์œผ๋กœ ์—‘์…€ ํŒŒ์ผ์„ ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ํŒจํ‚ค์ง€์ด๋‹ค.

File > Settings > Python Interpreter > openpyxl > Install Package๋ฅผ ํ†ตํ•ด ํŒจํ‚ค์ง€๋ฅผ ๋‹ค์šด๋ฐ›์•„ ์ค€๋‹ค.

from openpyxl import Workbook

wb = Workbook()
ws1 = wb.active
ws1.title = "articles" # ์‹œํŠธ ์ œ๋ชฉ
ws1.append(["์ œ๋ชฉ", "๋งํฌ", "์‹ ๋ฌธ์‚ฌ"])

wb.save(filename='articles.xlsx')

articles.xlsx๋ผ๋Š” ์—‘์…€ํŒŒ์ผ์ด ์ƒ์„ฑ๋˜๊ณ , ์ฒซ๋ฒˆ์งธ ์ค„์— ์ œ๋ชฉ, ๋งํฌ, ์‹ ๋ฌธ์‚ฌ๊ฐ€ ์ถ”๊ฐ€๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

โš  ์ฃผ์˜ํ•  ์ !! ํŒŒ์ด์ฌ ํ”„๋กœ๊ทธ๋žจ์„ ๋‹ค์‹œ ์‹คํ–‰์‹œํ‚ฌ๋• ์—‘์…€ ํŒŒ์ผ์„ ๊ผญ ๋‹ซ๊ณ  ์‹คํ–‰ํ•˜์ž!!

 

์ด์ œ, ๋‰ด์Šค ํฌ๋กค๋งํ•œ ๋‚ด์šฉ์„ ์—‘์…€ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ธฐ.

from bs4 import BeautifulSoup
from selenium import webdriver
from openpyxl import Workbook

driver = webdriver.Chrome('chromedriver')

url = "https://search.naver.com/search.naver?where=news&sm=tab_jum&query=์ถ”์„"

driver.get(url)
req = driver.page_source
soup = BeautifulSoup(req, 'html.parser')

articles = soup.select('#main_pack > div.news.mynews.section._prs_nws > ul > li')

wb = Workbook()
ws1 = wb.active
ws1.title = "articles" # ์‹œํŠธ ์ œ๋ชฉ
ws1.append(["์ œ๋ชฉ", "๋งํฌ", "์‹ ๋ฌธ์‚ฌ"])

for article in articles:
    title = article.select_one('dl > dt > a').text
    url = article.select_one('dl > dt > a')['href']
    comp = article.select_one('span._sp_each_source').text.split(" ")[0].replace('์–ธ๋ก ์‚ฌ','')

    ws1.append([title, url, comp])

driver.quit()
wb.save(filename='articles.xlsx')

 

ํŒŒ์ด์ฌ์œผ๋กœ ์ด๋ฉ”์ผ ๋ณด๋‚ด๊ธฐ (SMTP)

SMTP์„œ๋ฒ„๋ฅผ ์‚ฌ์šฉํ•ด ํฌ๋กค๋งํ•œ ์—‘์…€ํŒŒ์ผ์„ ์ด๋ฉ”์ผ๋กœ ๋ณด๋‚ผ ๊ฒƒ์ด๋‹ค.

๊ตฌ๊ธ€๋ฉ”์ผ(gmail.com)์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, 2๊ฐ€์ง€ ์„ค์ •์„ ํ•ด์•ผํ•œ๋‹ค.

 

1. ๋ณด์•ˆ ์ˆ˜์ค€์ด ๋‚ฎ์€ ์•ฑ ์—‘์„ธ์Šค ํ—ˆ์šฉ

myaccount.google.com/lesssecureapps

 

๋กœ๊ทธ์ธ - Google ๊ณ„์ •

ํ•˜๋‚˜์˜ ๊ณ„์ •์œผ๋กœ ๋ชจ๋“  Google ์„œ๋น„์Šค๋ฅผ Google ๊ณ„์ •์œผ๋กœ ๋กœ๊ทธ์ธ

accounts.google.com

'์‚ฌ์šฉ'์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ์–ด์•ผ ํ•œ๋‹ค.

 

2. 2๋‹จ๊ณ„ ์ธ์ฆ์„ ํ’€์–ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.

(์ €๋Š” ์›๋ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ PASS)

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email.mime.text import MIMEText
from email import encoders

# ๋ณด๋‚ด๋Š” ์‚ฌ๋žŒ ์ •๋ณด
me = "๋ณด๋‚ด๋Š” ์‚ฌ๋žŒ์˜ ์ด๋ฉ”์ผ"
my_password = "์ด๋ฉ”์ผ ๊ณ„์ •์˜ ๋น„๋ฐ€๋ฒˆํ˜ธ"

# ๋กœ๊ทธ์ธํ•˜๊ธฐ
s = smtplib.SMTP_SSL('smtp.gmail.com')
s.login(me, my_password)

# ๋ฐ›๋Š” ์‚ฌ๋žŒ ์ •๋ณด
you = "๋ฐ›๋Š” ์‚ฌ๋žŒ ์ด๋ฉ”์ผ"

# ๋ฉ”์ผ ๊ธฐ๋ณธ ์ •๋ณด ์„ค์ •
msg = MIMEMultipart('alternative')
msg['Subject'] = "๋ณด๋‚ผ ๋ฉ”์ผ ์ œ๋ชฉ"
msg['From'] = me
msg['To'] = you

# ๋ฉ”์ผ ๋‚ด์šฉ ์“ฐ๊ธฐ
content = "๋ณด๋‚ผ ๋ฉ”์ผ์˜ ๋‚ด์šฉ"
part2 = MIMEText(content, 'plain')
msg.attach(part2)

part = MIMEBase('application', "octet-stream")
with open("articles.xlsx", 'rb') as file:
    part.set_payload(file.read())
encoders.encode_base64(part)
part.add_header('Content-Disposition', "attachment", filename="์ถ”์„๊ธฐ์‚ฌ.xlsx") # ์ฒจ๋ถ€ํŒŒ์ผ ์ด๋ฆ„
msg.attach(part)

# ๋ฉ”์ผ ๋ณด๋‚ด๊ณ  ์„œ๋ฒ„ ๋„๊ธฐ
s.sendmail(me, you, msg.as_string())
s.quit()

ํ˜น์‹œ ์—ฌ๋Ÿฌ๋ช…์—๊ฒŒ ๋งค์ผ์„ ๋ณด๋‚ด๊ณ  ์‹ถ๋‹ค๋ฉด,

emails = ["๋ฉ”์ผ1", "๋ฉ”์ผ2"...] ๊ฐ™์€ ๋ฐฐ์—ด์„ ๋งŒ๋“ค์–ด์ค€ ํ›„ for ๋ฌธ์„ ๋Œ๋ฆฌ๋ฉด ๋œ๋‹ค!