Web crawling
The toolkit provide an API for importing the content of a website. The SERMAS toolkit automatically scans the entire website and creates the emdeddings of the content of each page, then stores them in a vector database.
A website can be imported by adding the link to the app.yaml
file under the rag.websites
section, for example:
rag:
websites:
- url: {website base URL}
filterPaths: [] # list of sub paths to exclude
- ...
Then import the application adding the -iw
flag, for example:
sermas-cli app admin import -iw /apps/myapp
The crawler will scrap the pages found using the sitemap. The sitemap will be searched on these subpaths:
'/sitemap.xml',
'/sitemap_index.xml'
'/sitemapindex.xml'
'/sitemap.php'
'/sitemap.txt'