Frustrations excluding urls without ‘www’ from Nutch 1.7 crawl

I’m currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.

Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both ‘www’ and ” urls such as:

My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
# skip URLs containing certain characters as probable queries, etc.
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
# accept anything else

Also on Stackoverflow and pastebin.

Leave a Reply

Your email address will not be published. Required fields are marked *