Overview
Data sources in CrawlDesk define the content (e.g., developer docs, PDFs, or cloud documents) that the AI crawls and indexes for search functionality. Each data source is tied to a specific project, and managing them is a critical step in setting up AI-powered search capabilities.
This guide provides an overview of data sources and their role in CrawlDesk. Access data source management from your project dashboard at Crawldesk Dashboard by selecting a project and navigating to the "Data Sources" section. Ensure you have admin access to the project and active data sources for optimal functionality.
How Data Sources Work with CrawlDesk
Data sources integrate seamlessly with CrawlDesk's AI ecosystem to transform your raw content into searchable, AI-ready knowledge. Here's a high-level explanation:
-
Addition and Configuration: When you add a data source (e.g., a website URL or Google Drive folder) via the project dashboard, CrawlDesk validates the input and queues it for processing. This includes specifying details like name, URL, and limits (e.g., max pages for websites).
-
Crawling and Indexing: CrawlDesk's AI crawler fetches the content, extracts relevant text, structures it into sections and chunks, and indexes it for fast retrieval. This process is automated and monitored in real-time, with progress visible in the dashboard (e.g., pages processed, sections created).
-
Integration with AI Features: Once indexed, the data powers AI-driven tools like search queries, chat widgets, and analytics. For example, the indexed chunks enable natural language responses in AI widgets, drawing directly from your documentation. Multiple data sources can be added and linked to features like "Ask AI" or "AI Copilot" for comprehensive knowledge coverage.
-
Maintenance and Updates: CrawlDesk handles ongoing management, including error detection (e.g., failed deployments) and notifications. You can view logs, URLs crawled, and settings for fine-tuning.
This workflow ensures your data remains up-to-date and accessible, reducing manual effort while enhancing AI accuracy.
Key Features and Benefits of Data Sources
- Auto Sync: Automatically detects and syncs changes in connected sources (e.g., updated Google Docs or Notion pages) to keep your index current without manual intervention.
- Recrawl: Schedule or trigger recrawls on demand to refresh outdated content, ensuring the AI always uses the latest data.
- Support for All Docs Platforms: Compatible with a wide range of platforms, including websites, PDFs, Google Docs, Confluence, Notion, Google Drive, and more, allowing seamless integration of diverse documentation types.
- Scalability: Handles large volumes of data with configurable limits (e.g., max pages) to prevent overload.
- Real-Time Monitoring: Provides detailed dashboards for tracking deployment progress, including metrics like processed pages, sections, and chunks.
- Error Handling and Logs: Built-in logging and failure notifications to troubleshoot issues quickly.
- Project-Level Isolation: Data sources are isolated per project, ensuring secure separation of data across different teams or use cases (e.g., marketing vs. engineering docs).
- Flexibility: Supports multiple data types, including websites, PDFs, Google Docs, Confluence, Notion, and Google Drive.
- Automation: Automatically queues and processes data for AI indexing.
- Monitoring: Provides real-time deployment progress and analytics.
Business Use Cases
Data sources in CrawlDesk enable businesses to make their documentation instantly searchable, improving efficiency and user experience. Below are key applications:
- Customer Support Knowledge Base: Index help center articles or FAQs from a website or PDFs to power AI-driven customer support, reducing response times and support ticket volume by up to 30%.
- Internal Knowledge Management: Crawl internal Confluence or Notion pages to create a searchable knowledge hub for employees, streamlining access to company policies or project documentation.
- Developer API Documentation: Index developer docs from websites or Google Docs to provide instant answers for technical queries, enhancing developer experience and adoption rates.
- Training and Onboarding Materials: Use Google Drive or PDF sources to index training manuals, enabling new hires to quickly find onboarding resources via AI search.
Example: A tech startup indexes their docusourus developer docs as a data source, enabling their support team to resolve API-related queries 50% faster by providing instant answers through a CrawlDesk AI widget.
Best Practices for Crawling Data Sources
To optimize crawling efficiency, accuracy, and performance in CrawlDesk, follow these best practices:
-
Start Small and Scale: Begin with a limited scope (e.g., set a low "Max Pages" value like 10-50 for websites) to test crawling and indexing. Gradually increase limits once validated to avoid overwhelming the system or incurring unnecessary processing.
-
Use Specific Domains and Paths: Specify precise URLs or folders (e.g.,
https://example.com/docs/
instead of the root domain) to focus crawling on relevant content, reducing noise and improving index quality. -
Verify Access and Permissions: Ensure the crawler has proper access (e.g., public URLs or authenticated integrations for Confluence/Google Drive). Test for errors like 403/404 in logs before full deployment.
-
Monitor and Optimize: Regularly check the dashboard for progress metrics (e.g., chunks indexed) and errors. Use logs to identify issues like duplicate content or crawl blocks, and adjust settings (e.g., exclude irrelevant sections) for better results.
-
Prioritize Content Quality: Focus on high-quality, text-rich sources.
Following these practices ensures efficient resource use, high AI relevance, and minimal downtime.
Related Resources
- Next Steps: See the Data Source Management Instructions for detailed steps on adding and monitoring data sources.