How do I crawl a website?

  1. Go to the Knowledge Base of the Library you want to add knowledge to.

  2. Click Add Knowledge button.

  1. Select Crawl website from the Add Knowledge window.

The page will be changed to enable you to crawl a website.

  1. Paste the URL of the site you want to crawl in the URL field.
  1. Click Submit button. Once submitted, it will take some time to crawl the website. You can check how many pages are currently found in the bottom left corner of the Crawl Website Section.
  1. Once the crawling is done, a window will show you the crawled pages and the number of credits that will be consumed to add the pages to the knowledge base.

You can selectively include crawled pages in your knowledge base by using the checkboxes located in the third column of the table.

For instance, in this case, I opted to exclude the Log in to DoNoHarm page from being added to the knowledge base. When you exclude a page, it will also result in a deduction of the credits used, based on the credit value associated with the excluded page.

  1. Click on the Submit button in the window to confirm and add it to your library.

It’s scraping unnecessary content. How do I refine what I scrape?

You can open the Advanced accordion and use selectors to select the content you want to scrape in the URL.

  1. Open the Advanced accordion by clicking on it.
  1. Input the selectors of the content you want to scrape in the text field. Thse selectors are used to select the content you want to scrape in the URL.

For example, you have a page that has the following code section:

Sample HTML Code
<section role="doc-introduction">
  <h2 id="introduction" tabindex="-1">Introduction</h2>
  
  <p>I realize not everybody’s going to ditch the Web and switch to Gemini or Gopher today (<span data-literal="that would be a difficult and unrealistic transition">that’ll take, like, at least a month /s</span>). Until that happens, here’s a non-exhaustive, highly-opinionated list of best practices for websites that focus primarily on text. I don’t expect anybody to fully agree with the list; nonetheless, the article should have at least some useful information for any web content author or front-end web developer.</p>
  
  <h3 id="inclusive-design" tabindex="-1">Inclusive design</h3>
  <p>My primary focus is <a href="https://100daysofa11y.com/2019/12/03/accommodation-versus-inclusive-design/">inclusive design</a>. Specifically, I focus on supporting <em>underrepresented ways to read a page</em>. Not all users load a page in a common web-browser and navigate effortlessly with their eyes and hands. Authors often neglect people who read through accessibility tools, tiny viewports, machine translators, “reading mode” implementations, the Tor network, printouts, hostile networks, and uncommon browsers, to name a few. I list more niches in <a href="#conclusion">the conclusion</a>. Compatibility with so many niches sounds far more daunting than it really is: if you only selectively override browser defaults and use plain-old, semantic HTML (<abbr title="plain-old, semantic HTML">POSH</abbr>), you’ve done half of the work already.</p>
  <p>One of the core ideas behind the flavor of inclusive design I present is <dfn id="inc-by-default" tabindex="-1">inclusivity by default</dfn>. Web pages shouldn’t use accessible overlays, reduced-data modes, or other personalizations if these features can be available all the time. Personalization isn’t always possible: Tor users, students using school computers, and people with restrictive corporate policies can’t “make websites work for them”; that’s a webmaster’s responsibility.</p>
  <p>At the same time, many users do apply personalizations; sites should respect those personalizations whenever possible. Balancing these two needs is difficult. Some features conflict; you can’t display a light and dark color scheme simultaneously. Personalization is a fallback strategy to resolve conflicting needs. Disproportionately underrepresented needs deserve disproportionately greater attention, so they come before personal preferences instead of being relegated to a separate lane.</p>
  
  <h3 id="prior-art" tabindex="-1">Prior art</h3>
  <p>You can regard this article as an elaboration on existing work by the Web Accessibility Initiative (<abbr title="Web Accessibility Initiative’s">WAI</abbr>).</p>
  <p>I’ll cite the <abbr>WAI’s</abbr> <span class="h-cite" itemprop="citation" itemscope="" itemtype="https://schema.org/TechArticle"><cite itemprop="name" class="p-name"><a class="u-url" itemprop="url" href="https://www.w3.org/WAI/WCAG22/Techniques/">Techniques for WCAG 2.2</a></cite></span> a number of times. Each “Success Criterion” (requirement) of the WCAG has possible techniques. Unlike the <cite>Web Content Accessibility Guidelines</cite> (<abbr title="Web Content Accessibility Guidelines">WCAG</abbr>), the Techniques document does not list requirements; rather, it serves to non-exhaustively educate authors about <em>how</em> to use specific technologies to comply with the WCAG. I don’t find much utility in the technology-agnostic goals enumerated by the WCAG without the accompanying technology-specific techniques to meet those goals.</p>
  <p>I’ll also cite <span class="h-cite" itemid="https://www.w3.org/TR/coga-usable/" itemprop="citation" itemscope="" itemtype="https://schema.org/TechArticle"><cite itemprop="name" class="p-name"><a class="u-url" itemprop="url" href="https://www.w3.org/TR/coga-usable/">Making Content Usable for People with Cognitive and Learning Disabilities</a></cite>, by <span itemscope="" itemtype="https://schema.org/Organization" itemprop="publisher">the WAI</span></span>. The document lists eight objectives. Each objective has associated personas, and can be met by several design patterns.</p>
  
  <h3 id="why-this-article-exists" tabindex="-1">Why this article exists</h3>
  <p>Performance and accessibility guidelines are scattered across multiple WAI documents and blog posts. Moreover, guidelines tend to be overly general and avoid giving specific advice. Guidelines from different places tend to contradict each other, especially when they have different goals (e.g., security and accessibility). They also tend to be focused on large corporate sites rather than the simple text-oriented content the Web was made for.</p>
  <p>I wanted to create a single reference with non-contradictory guidelines, containing advice more specific and opinionated than existing material. I also wanted to approach the very different aspects of site design from the same perspective and in the same place, allowing readers to draw connections between them.</p>
</section> 

And you only want to scrape the text in Inclusive Design and Prior Art. You can use the following CSS selector to scrape those specific content.

CSS Selector
#inclusive-design ~ p, #prior-art ~ p
  1. Click on Submit button.

Once submitted, a window will appear showing how much credits you will use to scrape.

An error will appear in the window if the selector you inputted is invalid. Make sure that the selector you’re using is valid to proceed.

  1. Click on the Submit button in the window to confirm the scraping and add it to your library.

Advanced Scraping Output