当前位置：首页 > 技术文章 > 正文内容

当AI遇上爬虫:ScrapeGraphAI + LLM实现前所未有的网页抓取效率!

zonemu2个月前 (08-12)技术文章34

当AI遇上爬虫：ScrapeGraphAI结合LLM实现前所未有的网页抓取效率，一言即搜！

原创 Aitrainee AI进修生

ScrapeGraphAI 是一个网络抓取Python 库，它使用 LLM 和直接图形逻辑为网站、文档和 XML 文件创建抓取管道。只需说出您想要提取哪些信息，ScrapeGraphAI就会为你完成！

在当今数据驱动的世界中，网络抓取已成为从广阔的互联网中收集信息的重要工具。然而，传统的网络抓取工具往往难以适应网站的动态特性，需要开发人员不断维护和更新。

输入 ScrapeGraphAI，这是一个革命性的 Python 库，它利用大型语言模型 (LLMs) 的强大功能和直接图形逻辑来创建灵活且适应性强的 Web 抓取管道。

ScrapeGraphAI 代表了网络抓取领域的重大进步，提供了一个开源解决方案，旨在应对当今不断发展的网络环境的挑战。这就是 ScrapeGraphAI 脱颖而出的原因：

直接图逻辑 ：此功能使用基于图的方法动态创建爬取管道，确保基于用户定义的提示实现高效的数据检索。

多功能模型和API ：ScrapeGraphAI支持各种模型和API，包括OpenAI的GPT、Docker、Groq、Azure等，允许用户根据自己的抓取需求选择最佳选项。

灵活性和适应性 ：传统的网页抓取工具通常依赖于固定模式或手动配置来从网页中提取数据。ScrapeGraphAI 由 LLMs 提供支持，可适应网站结构的变化，减少开发人员持续干预的需要。

易于安装 ：通过简单的 pip install 命令，用户可以快速设置 ScrapeGraphAI 并开始从网站、文档和 XML 文件中抓取数据。

ScrapeGraphAI：您只需一次爬取

快速安装

Scrapegraph-ai 的参考页面可在 pypy 的官方页面上找到： pypi 。

pip install scrapegraphai

您还需要安装 Playwright 以进行基于 JavaScript 的爬取：

playwright install

注意：建议在虚拟环境中安装库，以避免与其他库的冲突

演示

官方 streamlit 演示：

https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-demo.streamlit.app/

在网上直接尝试使用 Google Colab：

https://colab.research.google.com/assets/colab-badge.svg

按照以下链接上的步骤设置您的 OpenAI API 密钥：[link]：

https://scrapegraph-ai.readthedocs.io/en/latest/index.html

文档

ScrapeGraphAI 的文档可以在[这里]：

https://scrapegraph-ai.readthedocs.io/en/latest/

还请查看 docusaurus [文档]：

https://scrapegraph-doc.onrender.com/

使用方法

您可以使用 SmartScraper 类通过提示从网站提取信息。

SmartScraper 类是一个直接图实现，使用网页爬取管道中最常见的节点。有关更多信息，请参阅文档。

情况 1：使用 Ollama 提取信息

记得单独在 Ollama 上下载模型！

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama 需要显式指定格式
        "base_url": "http://localhost:11434",  # 设置 Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # 设置 Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # 也可以使用已下载的 HTML 代码的字符串
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 2：使用 Docker 提取信息

注意：在使用本地模型之前，请记得创建 docker 容器！

    docker-compose up -d
    docker exec -it ollama ollama pull stablelm-zephyr

您可以使用 Ollama 上可用的模型或您自己的模型，而不是 stablelm-zephyr

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama 需要显式指定格式
        # "model_tokens": 2000, # 设置上下文长度任意
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # 也可以使用已下载的 HTML 代码的字符串
    source="https://perinim.github.io/projects",  
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 3：使用 Openai 模型提取信息

from scrapegraphai.graphs import SmartScraperGraph
OPENAI_API_KEY = "YOUR_API_KEY"

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # 也可以使用已下载的 HTML 代码的字符串
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 4：使用 Groq 提取信息

from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

groq_key = os.getenv("GROQ_APIKEY")

graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",
        "api_key": groq_key,
        "temperature": 0
    },
    "embeddings": {


        "model": "ollama/nomic-embed-text",
        "temperature": 0,
        "base_url": "http://localhost:11434", 
    },
    "headless": False
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description and the author.",
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 5：使用 Azure 提取信息

from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

lm_model_instance = AzureChatOpenAI(
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
)

embedder_model_instance = AzureOpenAIEmbeddings(
    azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
graph_config = {
    "llm": {"model_instance": llm_model_instance},
    "embeddings": {"model_instance": embedder_model_instance}
}

smart_scraper_graph = SmartScraperGraph(
    prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, 
    event_end_date, event_end_time, location, event_mode, event_category, 
    third_party_redirect, no_of_days, 
    time_in_hours, hosted_or_attending, refreshments_type, 
    registration_available, registration_link""",
    source="https://www.hmhco.com/event",
    config=graph_config
)

情况 6：使用 Gemini 提取信息

from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": GOOGLE_APIKEY,
        "model": "gemini-pro",
    },
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

所有 3 个情况的输出将是一个包含提取信息的字典，例如：

{
    'titles': [
        'Rotary Pendulum RL'
        ],
    'descriptions': [
        'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
        ]
}

参考链接

Github: https://github.com/VinciGit00/Scrapegraph-ai?tab=readme-ov-file

Colab Notebook: https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing

— 完 —

点这里关注我，记得标星哦～

一键三连「分享」、「点赞」和「在看」

科技前沿进展日日相见 ~

10个实例小练习，快速入门熟练 Vue3 核心新特性(一)

作者：xuying 全栈修炼转发链接：https://mp.weixin.qq.com/s/_n2seDbbiO5hXQfuUGbUCQ前言Vue3.0 发 beta 版都有一段时间了，正式版也不远了...

编写简单的.gitlab-ci.yml打包部署项目

服务器说明：192.168.192.120：项目服务器192.168.192.121：GitLab为了可以使用gitlab的cicd功能，我们需要先安装GitLab Runner安装GitLab Ru...

VIM配置整理（vim配置教程）

一、基本配色set number set showcmd set incsearch set expandtab set showcmd set history=400 set autoread se...

我的VIM配置（如何配置vim编辑环境）

写一篇关于VIM配置的文章，记录下自己的VIM配置，力求简洁实用。VIM的配置保存在文件~/.vimrc中(Windows下是C:\Users\yourname \_vimrc)。VIM除了自身可配置...

HTML5学习笔记三:HTML5语法规则（html5语法详解）

1.标签要小写2.属性值可加可不加””或”3.可以省略某些标签 html body head tbody4.可以省略某些结束标签 tr td li例：显示效果：5.单标签不用加结束标签img inpu...

HTML5+眼球追踪?黑科技颠覆传统手机体验

今天，iH5工具推出一个新的神秘功能——眼动追踪，可以通过摄像头捕捉观众眼球活动！为了给大家具体演示该功能的使用，我做了一个案例，供大家参考。实际效果如下：案例比较简单，就是通过眼动功能获取视觉焦点位...

空木资源网