python爬虫学习笔记--BeautifulSoup4库的使用详解

时间：2024-04-30 21:23:25 来源：网络浏览：191次

目录使用范例常用的对象–Tag常用的对象–NavigableString常用的对象–BeautifulSoup常用的对象–Comment对文档树的遍历tag中包含多个字符串的情况.stripped_strings 去除空白内容搜索文档树–find和find_allselect方法(各种查找)获取内容总结

使用范例

from bs4 import BeautifulSoup#创建 Beautiful Soup 对象# 使用lxml来进行解析soup = BeautifulSoup(html,"lxml")print(soup.prettify())

返回结果

python爬虫学习笔记--BeautifulSoup4库的使用详解

常用的对象–Tag

就是 HTML 中的一个个标签

在上面范例的基础上添加

from bs4 import BeautifulSoup#创建 Beautiful Soup 对象# 使用lxml来进行解析soup = BeautifulSoup(html,"lxml")#print(soup.prettify())#创建 Beautiful Soup 对象soup = BeautifulSoup(html,’lxml’)print (soup.title)#None因为这里没有tiele标签所以返回noneprint (soup.head)#None因为这里没有head标签所以返回noneprint (soup.a)#返回 <a class="fill-dec" href="//my.csdn.net" target="_blank">编辑自我介绍，让更多人了解你<span class="write-icon"></span></a>print (type(soup.p))#返回 <class ’bs4.element.Tag’>print( soup.p)

其中print( soup.p)

返回结果为

python爬虫学习笔记--BeautifulSoup4库的使用详解

同样地，在上面地基础上添加

print (soup.name)# [document] #soup 对象本身比较特殊，它的 name 即为 [document]

python爬虫学习笔记--BeautifulSoup4库的使用详解

print (soup.head.name)#head #对于其他内部标签，输出的值为标签本身的名称

print (soup.p.attrs)##把p标签的所有属性打印出来,得到的类型是一个字典。

python爬虫学习笔记--BeautifulSoup4库的使用详解

print (soup.p[’class’])#获取P标签下地class标签

soup.p[’class’] = "newClass"print (soup.p) # 可以对这些属性和内容等等进行修改

python爬虫学习笔记--BeautifulSoup4库的使用详解

常用的对象–NavigableString

前面地基础上添加

print (soup.p.string)# The Dormouse’s storyprint (type(soup.p.string))# <class ’bs4.element.NavigableString’>thon

返回结果

python爬虫学习笔记--BeautifulSoup4库的使用详解

常用的对象–BeautifulSoup

beautiful soup对象表示文档的全部内容。大多数情况下，它可以被视为标记对象。它支持遍历文档树并搜索文档树中描述的大多数方法因为Beauty soup对象不是真正的HTML或XML标记，所以它没有名称和属性。但是，有时查看其内容很方便。Name属性，因此美丽的汤对象包含一个特殊属性。值为“[文档]”的名称

print(soup.name)#返回 ’[document]’

常用的对象–Comment

用于解释注释部分的内容

markup = "<b></b>"soup = BeautifulSoup(markup)comment = soup.b.stringtype(comment)# <class ’bs4.element.Comment’>