We use our own spidering software to automatically crawl each website on the platform. If you're not seeing any data in the Content 360 module, then there could be a number of causes for this:
- IP blocking - your server could simply be blocking our crawler server by blocking our IP). If this is the case, there is no way around this unless you remove this block. Our bot will show up in your server logs as follows:
126.96.36.199 Mozilla/5.0 (compatible; linkdexbot/2.1; +http://www.linkdex.com/bots/)
- Robots.txt blocking - as you can see, our crawler is called linkdexbot. If your robots.txt includes a link blocking this user agent, then we will follow good practice, obey that instruction and not crawl your site. You would need to get this restriction removed if you want us to be able to crawl the site. If you fix the issue with robots.txt, please let us know by logging a ticket with our support team (firstname.lastname@example.org) and we will restart both jobs.
NB: if you're comparing crawl number with Google's number of pages indexed (use the site: Google 'hack'), then there will be a number of reason why these numbers can vary considerably. Apart from the points already outlined above, you also need to consider the following:
- 404s - we won't count 404s as pages crawled, but will report those in the Deadlinks module. However, Google may include these in its total number of indexed pages.
- Orphaned pages - Google may index these if it has another way of finding them (from external backlinks from other unique domains). Remember, we won't see these as we're spidering internal links only.
- The site: hack may include sub domains which we will ignore as our spidering software will consider all sub domains as off domain links and will not follow them; Google, however, may be giving you an number of pages indexed for the whole domain. You can easily verify this by running a site: hack and flicking the sample results Google returns.