Web Crawl Node

Create a Web Crawl Node to allow Lamatic to ingest and store data from a website to be accessed downstream.

Specific requirements:
1) Accept a URL, multiple URLs, a sitemap.xml, or a URL pattern regex as a parameter to direct the crawl.

Consider what would be needed to accept login credentials as a param to enable ingesting content behind a login (future).

2) Support the ability to store important metadata along with the visible content (for instance, the hreflang tag that will indicate the language & geography of the target audience).

Consider a default metadata capture config that can be optionally reduced or expanded (future).

3) Generate status and error messaging:

Accept parameters that configure status notifications (what notifications to send & to whom?).

Thoughts/Questions:

How to tell the Node where to store the crawled content?
How should the visibility of this store be determined (in other words, is this store available only to this workflow, others within the Project, or others within the Organization)?
Should the Node support an option to simultaneously vectorize the content as part of the crawl (to reduce storage needs)?
Can/should we use the <lastmod> attribute on the Sitemap.xml to trigger this Node?
Can/should the Node capture Image, Audio, and/or Video content?
How should notifications & error handling work (regarding the status of the crawl)?

Lamatic.ai

Subscribe to post

Subscribe to post