[매일코딩] 005 - Puppeteer 를 써서 web 스크래핑(웹 정보가져오기) by wonsama

hive-101145 · @wonsama · Jun 2 '20

$6.05

[매일코딩] 005 - Puppeteer 를 써서 web 스크래핑(웹 정보가져오기)

![](https://steemitimages.com/1280x0/https://cdn.steemitimages.com/DQme7wpJFnH24PeV6xEhbEEoQsFUBYCEixN1MR1KxBxam2q/dailycoding.jpg)

# 출처

> [Basic Web Scraping Using JavaScript with Node.js + Puppeteer](https://www.freecodecamp.org/news/web-scraping-using-nodejs-puppeteer/)

이 글에서는 JavaScript 로 브라우저 자동화를 사용하여 웹 페이지 정보를 가져오는(scrap) 방법을 배웁니다.  이를 위해 Puppeteer(꼭두각시인형) 을 사용할 것입니다.

# 필요 툴(라이브러리)

* Nodejs - https://nodejs.org/en/
* Puppeteer - https://github.com/puppeteer/puppeteer

# 작업방법 

> 일반적으로 웹 스크래핑은 아래 2가지 방법으로 나뉘어집니다.

* http 요청을 통해 데이터 가져오기
* html DOM 구조를 분석(parsing) 하여 중요 데이터 추출하기

본 예제에서는 웹사이트에서 책의 가격과 제목 정보를 가져오는 연습을 할 것 입니다.

# 설치

> nodejs 가 설치 되었다는 가정 하에 puppeteer 를 설치합니다.

```
$ mkdir scraper
$ cd scraper
$ npm i puppeteer — save
```

# 준비단계

> 대략적인 작업 틀을 구성해 봤습니다. 

* 라이브러리를 로드하고 
* 스크랩 함수를 만들고 
* 스크랩 이후 후처리를 수행합니다.

```
const puppeteer = require(‘puppeteer’);
let scrape = async () => {   // 실제 스크래핑은 여기부서 시작
// Return a value
}; 

scrape().then((value) => {   

    console.log(value);  // 성공 !
 });
```

# 1단계

> 웹 브라우져를 열어 특정 페이지로 이동한 이후 해당 페이지의 정보를 가져옵니다.

```
let scrape = async () => {  
 const browser = await puppeteer.launch({headless: false});  
 const page = await browser.newPage();  
 await page.goto('http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');  
 await page.waitFor(1000);  // Scrape   
 browser.close();  
 return result;
 };
```

# 2단계 - 스크래핑

> 화면 정보가 갱신되길 기다린 이후 dom 정보에서 추출하려는 정보를 확인한 이후 제목(title)과 가격(price) 정보를 추출 합니다.

![](https://cdn.steemitimages.com/DQmV1a77cYkkHhNhpbeN9mrSA9HVE21pL8zCjxUsfRNjSmJ/image.png)

```
const result = await page.evaluate(() => {  
 let title = document.querySelector('h1').innerText;  
 let price = document.querySelector('.price_color').innerText;
 return {  title,  price}});
```

# 최종 코드

```
const puppeteer = require(‘puppeteer’);
let scrape = async () => { 
 const browser = await puppeteer.launch({headless: false}); 
 const page = await browser.newPage(); 
 await page.goto(‘http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'); 
 await page.waitFor(1000); 
 const result = await page.evaluate(() => {  
  let title = document.querySelector(‘h1’).innerText;  
  let price = document.querySelector(‘.price_color’).innerText; 
 return {title,price}});
 browser.close(); 
 return result;
 };
 scrape().then((value) => { 
  console.log(value); // Success!
 });
```

# 실행

> 실행 결과 아래와 같이 책의 제목과 가격 정보를 확인할 수 있습니다.

```
node scrape.js// { title: 'A Light in the Attic', price: '£51.77' }
```

# 참조

* [List of free web scraping tools](https://www.scrapingdog.com/blog/top-10-best-data-scraping-tools-&-web-scraper)
* [The 10 Best web scraping proxy services](https://www.scrapingdog.com/blog/top-10-residential-proxy-providers-2020)
* [Puppeteer Documentation](https://github.com/GoogleChrome/puppeteer)
* [Scrapingdog Documentation](https://www.scrapingdog.com/documentation.html)
* [Guide to web scraping](https://www.scrapingdog.com/blog/ultimate-guide-to-web-scraping.html)

# 맺음말

> 웹에 게시된 정보를 가져와 재판매 하는 행위는 불법일 수도 있으므로 매우 유의 해야 됨

웹 스크래핑을 이용하면 다양한 정보를 가져와 내가 원하는 2차적인 정보를 제 가공 하여 구성할 수 있습니다. 예를 들면 세계 각국에서 수치화 된 정보를 가져와서 해당 정보를 모아 차트로 보여 주면 매우 유익하겠죠 :)

오늘도 멋진 코딩 라이프 즐기셨음 하는 바램입니다.

👍 gotogether, wisdomandjustice, bcm, oldstone, ctime, yasu, j-car, steem-agora, kibumh, skymin, honeybeerbear, ioioioioi, tailcock, yjs3694, karja, lovelyyeon.sct, onepercentbetter, quochuy, hbbear.report, jack8831, sct.curator, goodhello.spt, zzan.hmy, jungch98, sct.awa, jstory, conormingregor, wonsama, realmankwon.scot, jhzzanglove, merlion, elkaos, hungrybear, jcarvoting, goodhellonode, heinzgugu, shallwebegin, pirateking, goodhihello, piratekiller, steemtelly, playsteemmonster, sct.ups, goodmonster, smplayer, shallwedance, iamyourfather, k-league, piratequeen, china.mobile, sojucaps, techken, guule14, bcm.zzan, wonsama.sct, waraira777, cupang, thinkwise

`post_id`	86,213,559
`author`	wonsama
`permlink`	005-puppeteer-web
`category`	hive-101145
`json_metadata`	{"tags":["hive-101145","sct-kr","sct-freeboard","sct","zzan","kr","kr-dev"],"image":["https:\/\/steemitimages.com\/1280x0\/https:\/\/cdn.steemitimages.com\/DQme7wpJFnH24PeV6xEhbEEoQsFUBYCEixN1MR1KxBxam2q\/dailycoding.jpg","https:\/\/cdn.steemitimages.com\/DQmV1a77cYkkHhNhpbeN9mrSA9HVE21pL8zCjxUsfRNjSmJ\/image.png"],"links":["https:\/\/www.freecodecamp.org\/news\/web-scraping-using-nodejs-puppeteer\/","https:\/\/nodejs.org\/en\/","https:\/\/github.com\/puppeteer\/puppeteer","https:\/\/www.scrapingdog.com\/blog\/top-10-best-data-scraping-tools-&-web-scraper","https:\/\/www.scrapingdog.com\/blog\/top-10-residential-proxy-providers-2020","https:\/\/github.com\/GoogleChrome\/puppeteer","https:\/\/www.scrapingdog.com\/documentation.html","https:\/\/www.scrapingdog.com\/blog\/ultimate-guide-to-web-scraping.html"],"app":"steemcoinpan\/0.1","format":"markdown","canonical_url":"https:\/\/www.steemcoinpan.com\/@wonsama\/005-puppeteer-web"}
`created`	2020-06-02 14:12:06
`last_update`	2020-06-02 14:12:06
`depth`	0
`children`	0
`net_rshares`	8,946,364,372,427
`last_payout`	2020-06-09 14:12:06
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	3.028 SBD
`curator_payout_value`	3.017 SBD
`pending_payout_value`	0.000 SBD
`promoted`	0.000 SBD
`body_length`	3,227
`author_reputation`	954,992,586,021,436
`root_title`	"[매일코딩] 005 - Puppeteer 를 써서 web 스크래핑(웹 정보가져오기)"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 SBD
`percent_steem_dollars`	10,000
`author_curate_reward`	""

properties (23)vote details (58)

voter	rshares	pct
oldstone	878,571,603,460	5%
jack8831	19,032,574,834	25%
cupang	100,120,264	90%
techken	438,542,554	1.2%
karja	55,669,167,070	3%
ioioioioi	84,637,159,129	100%
yjs3694	61,464,152,979	100%
conormingregor	6,502,926,809	100%
yasu	564,914,868,989	100%
heinzgugu	1,279,668,517	100%
onepercentbetter	29,714,287,086	12%
tailcock	67,269,338,757	31%
wonsama	6,491,855,966	100%
jungch98	8,790,978,434	100%
quochuy	21,730,784,890	9.13%
wisdomandjustice	1,631,968,581,840	10%
kibumh	306,785,933,933	25%
merlion	3,913,725,552	12%
thinkwise	5,866,392	10%
china.mobile	627,888,620	35%
honeybeerbear	94,946,008,021	100%
sojucaps	514,643,742	100%
elkaos	3,053,265,286	20%
playsteemmonster	1,109,208,201	100%
iamyourfather	868,021,331	100%
ctime	830,050,940,523	12%
goodmonster	881,339,448	100%
shallwedance	868,043,178	100%
pirateking	1,233,112,436	100%
piratequeen	865,366,097	100%
piratekiller	1,230,514,191	100%
smplayer	868,316,922	100%
shallwebegin	1,247,856,384	100%
k-league	867,394,709	100%
steemtelly	1,220,119,312	30.73%
skymin	104,521,376,409	50%
hungrybear	2,180,913,932	12%
guule14	390,935,358	100%
j-car	472,849,661,564	24.5%
jcarvoting	1,440,347,849	80%
lovelyyeon.sct	47,093,771,831	100%
waraira777	105,651,400	100%
jstory	7,059,561,250	50%
wonsama.sct	193,848,073	100%
goodhello.spt	17,001,373,513	100%
zzan.hmy	12,693,644,021	7.8%
sct.awa	8,475,581,240	100%
realmankwon.scot	4,244,206,943	100%
sct.ups	1,002,004,432	100%
jhzzanglove	4,093,172,504	100%
bcm	1,054,500,541,801	44.85%
sct.curator	18,024,257,065	21.27%
steem-agora	392,705,351,549	10%
goodhellonode	1,402,949,957	100%
goodhihello	1,230,720,439	100%
hbbear.report	20,998,922,256	100%
bcm.zzan	354,415,472	53.82%
gotogether	2,084,066,987,713	3.3%