Sitemap
Extends from the WebBaseLoader, SitemapLoader loads a sitemap from a given URL, and then scrapes and loads all pages in the sitemap, returning each page as a Document.
The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load you can increase this limit. Note, while this will speed up the scraping process, it may cause the server to block you. Be careful!
Overview
Integration details
| Class | Package | Local | Serializable | JS support | 
|---|---|---|---|---|
| SiteMapLoader | langchain_community | ✅ | ❌ | ✅ | 
Loader features
| Source | Document Lazy Loading | Native Async Support | 
|---|---|---|
| SiteMapLoader | ✅ | ❌ | 
Setup
To access SiteMap document loader you'll need to install the langchain-community integration package.
Credentials
No credentials are needed to run this.
If you want to get automated best in-class tracing of your model calls you can also set your LangSmith API key by uncommenting below:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"
Installation
Install langchain_community.
%pip install -qU langchain-community
Fix notebook asyncio bug
import nest_asyncio
nest_asyncio.apply()
Initialization
Now we can instantiate our model object and load documents:
from langchain_community.document_loaders.sitemap import SitemapLoader
sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")