Search engine submission - a waste of time!
June 13, 2008 | SEO
At this point I should say the process of search engine submission is a complete waste of time. You should never submit a page that is already in a site’s index. Search Engine Submission is to alert an engine to new pages. Each engine is quite particular about how you submit to it. This is because people make pointless submissions on a regular basis, submitting pages that are not only are known about but haven’t even been updated. If you do this you could find yourself penalised depending on the engine or whether you’ve broken it’s rules. Even if you have updated a page or more importantly have a new page it is still a waste of time submitting it.
When you submit to an engine’s search engine submission process it goes in a long list of submissions. Each engine as we have pointed out has its own method of finding new content. To spend any amount of time processing pointless user generated queries is going to make any search engine less efficient. Imagine you are Google! Now imagine keeping tabs on 160,000,000,000 web pages. Any distraction from that and Google would be presenting more and more out of date results. That’s not great for the Big G’s image. By the time an engine does deal with a worthwhile submission it is several weeks later at the very earliest, buy that time it is no longer worthwhile and I’d hope if you have new pages you’d prefer them to spidered in days not weeks or months.
An engine will find a new page or site by finding a link to it on another page, and this is how you get pages spidered quickly. It’s as simple as that! If you have a new page it’s best to link to it from your index page as it will get picked up earlier, better also if you can get a link from an external website.
The more important the website and particularly the more frequently content is update on it , the more regularly it will get spidered. Here at WMA know of a site or two where we can get links spidered in a matter of mins! But you would expect that wouldn’t you!
Fresh Content
June 12, 2008 | SEO
Fresh Content is the best way to get a spider to visit your site. This is a really, really important subject to post so we will do so in another post. You can’t dictate to a spider how frequently it visits your site or what pages it visits. Search Engines know how frequently your pages get changed because they store some of the information they retrieve every time they visit. A sitemap will help, having good navigation architecture in your site is a must and it’s good to link to new pages, especially critical ones from your index page as that’s the page spiders will visit more frequently. The best thing to do is create fresh content. The more engines visit and find new content the more frequently they will come back.
Robots Meta Tag
June 11, 2008 | SEO, Uncategorized
Yesterday we looked at how to use the Robot Exclusion Standard (that is your “robots.txt” file to you and me) that prevents spiders from visiting certain pages on your web site. Today we’re going to look at how to do the same with the Robots Meta Tag.
This is similar to your robots.txt file but is limited to stipulating all spiders, but has the added option of stipulating that a spider can index but not follow links on the page or vice versa. This is useful for pages you want indexed but don’t want to pass Page Rank too.
I see plenty of sites that have ‘yes please follow my links and yes please spider my page,’ well, the spiders are going to do this anyway so why should you bother? The answer to this one is simply that you shouldn’t:
<META NAME="ROBOTS" CONTENT="INDEX, FOLLOW">
The three variations are please index but don’t follow, please follow but don’t index or don’t do both.
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
We’d advise if your going to use the “no follow” and “no index” tag it’s preferable to stipulate that in your robots.txt file as for the spiders it’s a bit like knocking on someone’s door,the door opens and your told not to come in. The power in the use of the Robots Meta Tag is where you want the one but don’t want the other.
It’s important to know that the index but don’t follow tag is one of the two ways you can have content spidered but don’t pass Link Juice or Page Rank to the page your linking to and is often used by rogue companies who advertise trading in either but actually cheat you out of a valid link. The other method is the “no follow” tag you can put in individual links as mentioned in our post on Benchmarking Incoming Links.
Robots.txt File
June 10, 2008 | SEO
In the previous post on Search Engine Friendly Websites we learnt what search engine spiders are. Now we can look at what we can do to influence spidering behaviour. Spidering is an often overlooked part of SEO. If we write a page of content and put in on the web, getting it spidered is the first thing we should be doing. Remember if Google doesn’t have it in its index it can’t rank it. Right?
We can actually dictate to each engine how we want it’s spiders to behave in our site, in fact they like us to do this. So how do we do it?
Technically we can do this with the robots text file, the robots meta tag and we can also use fresh content like a blog to attract a spiders attention. Today we’re going to stick to the “robots.txt” file and tomorrow we will deal with the Robots Meta Tag
Robots.txt
The robot exclusion standard (also known as the Robots Exclusion Protocol) is a method of stipulating to spiders what they can or can’t see, which is in the robots.txt file, this is a list of folders and files you don’t want the spiders to index.
Firstly you stipulate which spiders you don’t want spidering and secondly what you do want them to spider, so:
User-agent: *
Stipulates that all bots follow your list, you may also want to just stipulate which bots if a particular engine is penalising you, say for duplicate content reasons, but others aren’t, so you can name each one uniquely
User-agent: Slurp
User-agent: Googlebot
User-agent: MSNBot
If you didn’t know “Slurp” is Yahoo’s bot. You then tell them what you don’t want them to see, so:
Disallow: /admin/
This tells your spider that you don’t want them looking at anything in the admin directory. Just as disallow is supported so is allow, but only by some engines but this includes the major player Google. This could be useful if you wish to include a particular file in a folder you have disallowed:
Allow: /don’t-look-here/but-visit-this-page.html
You can also tell a spider where your sitemap is. If you don’t have sitemap create one, even if you don’t think it’s necessary check by typing site:mysite.com into any engine. It will tell you how many pages are in its index from the site you stipulated. If there are big variations then you know some spiders are having problems, and this is the best way to sort that out after you check your sites navigational architecture.
MSNbot isn’t so good at spidering as the other two big engines, it tends to find problems and respond to them by either spidering a bunch of unimportant content or ignoring it altogether. It also doesn’t have a webmaster tool, unlike Yahoo or Google, where you can tell them directly where your sitemap is so this is the next best thing. This looks like this:
Sitemap: http://www.mysite.com/sitemap.xml.
You can also if you want stipulate how many visits per second, for example this would be 1 visit every 60 seconds
Request-rate: 1/60
You can also stipulate a visit time, which may be useful if you have a particular promotion or conference call with your users at a particular time of each day or your response times slow down at certain times of the day you can free your web sites server from spiders.
Visit-time: 0600-0845
Note: This is highly important, make sure you spell everything in your robots.txt file correctly. I have seen examples of major engines not spidering whole sites because the robots.txt file has typos in it.
Good spiders follow this to the letter, bad spiders don’t. You can trust spiders from the big search engines like Google, Yahoo and MSN. If there is a rogue spider, such as one harvesting emails to sell for spam then these types of spiders are not. So just using the robots.txt file on confidential information alone will not make those parts of your site secure so don’t use it solely for this purpose. But you can be assured that good spiders will follow your robots.txt file to the letter.
Spiders - How They Index The Web
June 9, 2008 | SEO
Spiders are the elements of Search Engines that find or index search engines. Though that’s pretty near the mark, if we’re being picky it’s not strictly true, in that they have nothing to do with the indexing process other than initial data capture of individual web pages. In fact they’re not even really called spiders that’s just the term us SEO’s use for them.
The term the engines use for them is bots, short for robots, which is a good description. Why Spiders? Well the best description of what they do is spidering or crawling! If you imagine the websites a spider would visit and dot them on a piece of paper, join up the dots and hey presto you’ve got a spiders web.
Bots are automated programs. That is they’re told to behave in a certain way given certain data by the search engines. So a search engine gives it’s bots a list of websites and the spiders trundle of at lightning speed capturing the source code of each site so it’s engine can start indexing it. It will report back on links it finds any to its engine and when the engine finds a link it hasn’t come across yet it will add that link to the list of another of its spiders.
If this all sounds a bit complicated, there are a few key learning points.
1. The search engines have a system of scouring the web for sites automatically.
2. As it’s done automatically, by machine, you website has to comply by certain rules otherwise it can’t be read
3. Bots/spiders love reporting links! Hint: get some links to your site!
