본문 바로가기
IT와 꿍짝

파이썬에서 BeautifulSoup 4 사용방법 (python 3.x)

by 해피비(HappyB, Happy plan B) 2021. 5. 2.
반응형

 안녕하세요. 행부장입니다.
파이썬에서 BeautifulSoup 4 사용방법 (python 3.x)입니다.

출처: 아래 첨부 참조

1. pip install bs4

2. 함수 사용하기
###################################################################
1) find('tag') - 처음 1개열만 반환

from bs4 import BeautifulSoup
myHtml01 = '''
<html>
	<head>
		<title> My HTML </title>
	</head>
	<body>
		<H1> 안녕, 세상아! Hello, World!</H1>
		<p align="center"> center </p>
		<p align="right"> right </p>
		<p align="left"> left </p>		
	</body>
</html> '''
	
soup = BeautifulSoup(myHtml01 , 'html.parser')
soup.find('p')
>>> <p align="center"> center </p>
	
soup.find('p', align='center')
>>> <p align="center"> center </p>

###################################################################	
2) find_all('tag') - 전체 열 반환
   find_all('tag)[number] - 해당배열건 열 반환
   find_all(['tag1', 'tag2']) - 여러 태그 값을 OR로 열 반환 	
	
soup.find_all('p')
>>> [<p align="center"> center </p>,
<p align="right"> right </p>,
<p align="left"> left </p>]
	
soup.find_all('p')[0]
>>> <p align="center"> center </p>
	
soup.find_all('p')[2]
>>> <p align="left"> left </p>
	
soup.find_all(['p','h1'])
>>> [<h1> 안녕, 세상아! Hello, World!</h1>,
	<p align="center"> center </p>,
	<p align="right"> right </p>,
	<p align="left"> left </p>]
    
data=soup.find_all('p')

for i in data :
	print (i.string)

>>> center 
	right 
	left
for i in data :
	print (i.get_text())
	
>>> center 
	right 
	left	

###################################################################	
3) select('상위tag > 상위다음(2번째)tag > 3번째tag') - 해당부분 열 반환
   select('상위tag.class > 상위다음(2번째)tag.class') - 해당부분 열 반환
   select('#id') - 해당 id명 해당부분 열 반환
   select('#id > tag.class') - 해당부분 열 반환
   select('tag[속성]') - 해당 tag 아래 속성과 일치하는 열 반환
   
myHtml02 = '''
<html> 
    <head> 
        잡지, 책 정보
    </head> 
    <body> 
        <h1> 잡지, 책 목록 
			<div>
				<p id='magzine1' class='magzines' title='잡지1'> 잡지1 
					<span class = 'price'> 7000원 </span> 
					<span class = 'count'> 10개 </span> 
					<span class = 'store'> 잡지1 </span>
					<a href="www.magazine.xxx">www.magazine1.xxx</a> 	
                </p>
            </div> 
			
            <div> 
				<p id='magzine2' class='magzines' title='잡지2'> 잡지2
					<span class = 'price'> 8500원 </span> 
					<span class = 'count'> 20개 </span> 
					<span class = 'store'> 잡지2</span>
					<a href="www.magazine.xxx">www.magazine2.xxx</a>
				</p> 
            </div>
			
            <div> 
			 <p id='book1' class='books' title='책1'> 책1
                <span class = 'price'> 13500원 </span> 
                <span class = 'count'> 30개 </span> 
                <span class = 'store'> 서점1</span>
				<a href="www.book.xxx">www.book1.xxx</a>
             </p> 
            <div>
			
			<div> 
			 <p id='book2' class='books' title='책2'> 책2
                <span class = 'price'> 18000원 </span> 
                <span class = 'count'> 10개 </span> 
                <span class = 'store'> 서점2</span>
				<a href="www.book.xxx">www.book2.xxx</a>
             </p> 
            <div> 
    </body> 
</html> '''   

soup02 = BeautifulSoup(myHtml02 , 'html.parser')

soup02.select('div > p > span') 
>>>[<span class="price"> 7000원 </span>,
 <span class="count"> 10개 </span>,
 <span class="store"> 잡지1 </span>,
 <span class="price"> 8500원 </span>,
 <span class="count"> 20개 </span>,
 <span class="store"> 잡지2</span>,
 <span class="price"> 13500원 </span>,
 <span class="count"> 30개 </span>,
 <span class="store"> 서점1</span>,
 <span class="price"> 18000원 </span>,
 <span class="count"> 10개 </span>,
 <span class="store"> 서점2</span>]

soup02.select('div > p > span')[3] 
>>><span class="price"> 8500원 </span>

soup02.select('#book1') 
>>>[<p class="books" id="book1" title="책1"> 책1
                 <span class="price"> 13500원 </span>
 <span class="count"> 30개 </span>
 <span class="store"> 서점1</span>
 </p>]

soup02.select('#book1 > span.store')
>>>[<span class="store"> 서점1</span>]

soup02.select('a[href]')
>>>[<a href="www.magazine.xxx">www.magazine1.xxx</a>,
 <a href="www.magazine.xxx">www.magazine2.xxx</a>,
 <a href="www.book.xxx">www.book1.xxx</a>,
 <a href="www.book.xxx">www.book2.xxx</a>]

soup02.select('a[href]')[1]
>>><a href="www.magazine.xxx">www.magazine2.xxx</a>

soup02.select('#book2 > a[href]')
[<a href="www.book.xxx">www.book2.xxx</a>]

Jupytor notbook 파일도 첨부합니다.

 

MyWork.ipynb
0.01MB

 

FAA298E4-E91A-48A2-9FA7-977601145146.png
2.24MB

 

도움이 되었다면 아래▼▼▼ 공감하트 클릭, 응원댓글 부탁드립니다.(공감과 댓글은 로그인이 필요 없어요)
감사합니다.

 

반응형

댓글