Web Crawl Node

Create a Web Crawl Node to allow Lamatic to ingest and store data from a website to be accessed downstream.

Specific requirements:
1) Accept a URL, multiple URLs, a sitemap.xml, or a URL pattern regex as a parameter to direct the crawl.

  • Consider what would be needed to accept login credentials as a param to enable ingesting content behind a login (future).

2) Support the ability to store important metadata along with the visible content (for instance, the hreflang tag that will indicate the language & geography of the target audience).

  • Consider a default metadata capture config that can be optionally reduced or expanded (future).

3) Generate status and error messaging:

  • Accept parameters that configure status notifications (what notifications to send & to whom?).

Thoughts/Questions:

  • How to tell the Node where to store the crawled content?

  • How should the visibility of this store be determined (in other words, is this store available only to this workflow, others within the Project, or others within the Organization)?

  • Should the Node support an option to simultaneously vectorize the content as part of the crawl (to reduce storage needs)?

  • Can/should we use the <lastmod> attribute on the Sitemap.xml to trigger this Node?

  • Can/should the Node capture Image, Audio, and/or Video content?

  • How should notifications & error handling work (regarding the status of the crawl)?

Please authenticate to join the conversation.

Upvoters
Status

Planned

Board

💡 Feature Requests

Date

About 2 months ago

Author

cwhiteman

Subscribe to post

Get notified by email when there are changes.