The reading level for this article is Novice
First of all, you should understand the basic differences between a “spider” and an “indexer”.
Finds pages, follows links, and the page content does not matter. It won’t crawl password protected sites, usually avoids dynamic content due to “spider traps”, and respects robots.txt and meta tag exclusions.
Primary job is to evaluate and score content (text, meta tag info, anchor text of incoming links, etc.). Removes duplicates and aliases and filters spam. When the indexer detects spam, it adds the domain to its “do not crawl list”.
Don’t require cookies or session IDs. Doing so can harm your site’s ability to be indexed. Why? Since the spider does not accept either, your content management system may start to feed the spider the same page over and over again with different session IDs. This causes a “spider trap” and an endless loop for the spider which evenutally times out and backs out. A strong warning is if one engine has a page saturation that is out of proportion to the others, that is a red flag that a spider trap occurred. You risk having your site dropped by that search engine within the next 90 days due to “indexing bloat”.
If you use a 404 error page, use only one generic paged designed for that purpose, but do not, for any reason, redirect the 404 to your home page. Ensure that the title of this page is named “Error 404” this will alert the spider and it will move on without indexing.
If you have moved your site or moved content, use a permanent 301 to point to the new content/domain. This will inform the spider and it will make note and crawl the site properly on its next visit. Doing so also has benefits as it will transfer all of the link credit from the old site to the new one. User bookmarks still function, as well as all old links. 301s can be left for long periods of time.
Avoid Excessive URL Depth
Having a deep site decreases the chance that the spider will find all of your pages. Very deep URLs tend not to rank as well, and make it difficult for visitors to email to others.
This would be classified as Depth One. It is NOT suggested here to have all pages on the root of the domain. Having depth of two or three levels would be acceptable.
This is six levels and probably would NOT be crawled.
Data-base Driven Sites
Static URLs get crawled, and dymanic pages that have incoming links from static pages will get crawled. However, links between dynamic pages are often problematic and sometimes do not get crawled. Limit the “URL Depth” when using a dynamic-to-static internal linking strategy. It is suggested to use the “Trusted Feed” program for Yahoo! Search. I highly recommend Evelyn Hepner at Position Technologies. Your site will need a minimum of 250 indexable pages. While you do have to pay for every click, by having the XML feed it could be more cost effective to do it this way than to change your entire architecture of your site.
Index Friendly Pages
It is vital that your site has unique content. The titles that you use should be page specific, meaning that the title of each page should be unique to the content on that page. This is also true with your meta tags, specifically with the description and keywords. If you update the page’s content in the future, review your title, description and keyword tags to ensure they are still relevant to the content. Only separate pages when there is separate content. Yahoo! would rather see one long page than five small pages.
You should avoid “spam” at all costs. This includes using “doorway pages” and “doorway domains”. Keyword stuffing is another area that should be avoided. Hidden text (text that is the same color as the background), hidden links, and even deceptive CSS can be detected by the indexer and viewed as spam. Link Farms, massive domain interlinking, which includes off-topic links (which tend to dilute valuable links), and cloaking are also areas to avoid.
Report “Spam” to Yahoo!
To report Spam sent an email with as much information as possible (the keywords used in the search, offending URL, and why it is considered spam) to: email@example.com.
Review Yahoo! Content Guidelines
It would be highly recommended to keep up-to-date on the content guidelines from Yahoo! at least once per quarter.
Designing in Flash
Yahoo! does not currently crack open Flash files to either follow links or extract textual elements. Even with the SDK from Macromedia the extracting text from SWF files provided little, if any, value. It was found that content providers weren’t optimizing the content for the search engines.
Test Your Site
You can use programs like Anawave’s WebSnake to mimic a “crawl” through your website to determine if a spider could deep crawl your site.