|
Cookbook /
ControllingWebRobotsSummary: How to control web robots or bots trying to scan files
Version: 1.0
Prerequisites:
Status: stable
Maintainer:
Categories: Security
QuestionHow can I control web robots that try to scan (or index) my wiki? The robots.txt file is too complicated to maintain. In particular, I don't want robots following Edit or History links. AnswerTo some extent, PmWiki already controls robots, but you can add custom markup to refine your control. As distributed, PmWiki adds <meta content='robots' ... /> tags automatically to every page. For normal browsing of pages not in the PmWiki group, the value is "index,follow"; for all other actions (edit, upload, diff, etc.) the value is "noindex,nofollow". The pages in the PmWiki group are not indexed, except for the PmWiki.PmWiki page itself. An admin can explicitly control the value of the robots meta-tag by setting An admin can add this custom markup to a config.php file
Markup('robots', 'directives',
'/\\(:robots\\s+(\\w[\\w\\s,]*):\\)/e',
"PZZ(\$GLOBALS['MetaRobots'] = '$1')");
Then one can do any of (:robots index,follow:) (:robots index,nofollow:) (:robots noindex,follow:) (:robots noindex,nofollow:) to change the <meta name='robots' ... /> tag. If you want to make sure that robots go ahead and index a page and follow links on all pages (including the PmWiki docs), then you can set (in local/config.php):
Newer versions of PmWiki (since 2.1.beta8) automatically return "403 Forbidden" errors to robots for any action other than ?action=browse, ?action=rss, or ?action=dc. You can extend this functionality to cookie setting configuration actions like ?setskin=... etc or other queries in links by adding a dummy action to the link: In addition, if $EnableRobotCloakActions is set, then any ?action= parameters are removed from page links when viewed by a robot, so that those robots won't blindly follow links to unimportant pages. At the moment $EnableRobotCloakActions is disabled by default, because some admins may feel that presenting robots with such modified views of a page might cause their sites to be negatively rated by search engines. (I've seen opinions on both sides of the issue here.) - Pm on Pmwiki-users list Another AnswerIn your skin's template file, add a Well-behaved search robots that would follow this link <a href='$PageUrl?action=diff' title='$[History of this page]'>$[Page History]</a> would not follow this one <a href='$PageUrl?action=diff' title='$[History of this page]' rel='nofollow'>$[Page History]</a> --HaganFox Actually, if one reads the Google link carefully it doesn't say that Updated AnswerFor PmWiki 2.2, here's something you can use if you want to allow robots to follow links to external sites and avoid wasting bandwidth by having robots blindly follow links to unimportant wiki pages. # Remove the default "rel='nofollow'" attribute for external links. $UrlLinkFmt = "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>"; # Eliminate forbidden ?action= values from page links returned to robots. $EnableRobotCloakActions = 1; --HaganFox DiscussionI cannot see any reason to add rel='nofollow' to the Edit Page and Page History links since pmwiki adds 'noindex, nofollow' automatically to the meta tag on the edit and history page, so they are not indexed by default. So it seems to me that pmwiki is controlling search bots and preventing well behaved bots (ones which look at the meta tag) to index the history and the edit pages. Not well behaved search bots may be better excluded from searching through the wiki by means of an exclusion in a robots.txt file. It would be good to have advise about this here too.
With pmwiki 2.0 beta 20 the attribute rel='nofollow' is added automatically to all links pointing to external sites, i.e. all url links. This extends pmwiki's attempts to control search bots even further and will help to reduce link-spamming. HansB From a post by Pm about "comment spamming": All of these options are presently available in PmWiki v2: 1. rel="nofollow" for all external links (new default for beta20) 2. rel="nofollow" for unapproved external links only 3. no rel="nofollow" at all (default for beta19 and earlier) 4. not linking unapproved external links at all Personally, on pmwiki.org I'm going to do #2 -- i.e., rel="nofollow" for unapproved external links only, because I want approved links to gain the page rank benefit of having been listed on my site. I'll probably also add an icon or marker after the unapproved links that lets them be quickly approved (via the appropriate password). There's also a fifth category of links -- those that are generated via the InterMap. It's my feeling that InterMap links should not receive the rel="nofollow", as those sites have already been approved by the site maintainer. But that will come in another release. In PmWiki v1, one can add rel="nofollow" to external links via: The custom markup doesn't have any effect, since $HTMLHeaderFmt['robots'] is defined earlier in stdconfig.php, with a fixed value based on $MetaRobots at this stage (global default). See: PITS.00393
How do you prevent robots from indexing or following links like ...Group/PageName/?setprefs=.... or any other cookie setting action link? (HansB) Note that you cannot prevent robots from following links, all you can do is advise them not to do so (and then refuse to serve content when they do). In July 2005, Google said that it would honor So, there are two approaches -- you can hide such links from the robot ("cloaking"), or you can forbid content to a robot that follows the link. Many people feel that cloaking is an unwise practice, so that pretty much leaves forbidding content. So, the basic approach would have to be to send a "403 Forbidden" response if a robot sends a url that contains any query parameters other than --Pm |