After exposing the Solr endpoint with a reverse proxy, it’s important to note that it also exposes the Solr admin panel to the end-user. This is not desired.
Flowchart of a RewriteRule directive that rests on website.com’s httpd.conf file.
- Solr’s admin panel becomes exposed from the reverse proxy.
RewriteRule ^/solr/$ / [R=301,L,DPI]
It’s encouraged that you secure your Solr instance by placing the application on a different file server and behind a firewall. That’s an issue if you are trying to consume data from the Solr instance leveraging AJAX techniques.
Flowchart of a reverse proxy directive that rests on website.com’s httpd.conf file.
- www.website.com and Apache Solr live on separate boxes.
- A firewall protecting Apache Solr plus the cross-domain issue does not expose the necessary end-point to consume via AJAX.
- Depending on your sys admin setups, Solr may not live on a fully qualified domain (ie. http://184.108.40.2069:8983/solr/#/)
- An AJAX call to consume the Solr instance’s JSON/XML won’t work cross-domain.
- Reverse Proxy directive, mod_proxy – Apache HTTP Server
- This allows for an endpoint that is visible to the browser and we can consume the JSON/XML that rests within the Solr instance.
ProxyPass /solr http://220.127.116.119:8983/solr/#/
ProxyPassReverse /solr http://18.104.22.1689:8983/solr/#/
Don’t forget to apply a RewriteRule Directive to protect the Solr admin panel, once you’ve exposed it to the browser!
I’ve implemented Ben Alman’s simple-proxy.php to communicate to an Apache Solr instance (in this case my local) outside of my domain.
I’ve followed the instructions in full, the core of which is to set the simple-proxy.php on my domain’s file server.
I’m curious on if there are any modifications that must be made to the proxy in order for the response to be in the correct format?
View on Stackoverflow.
In the Fall of 2013, my team was tasked with R&D on integrating a search solution within the University of Rockies. Starting from the ground up, we pursued the idea of open-source search server, Apache Solr. After hours vetting out a workflow and experimenting, we were able to create a search product that not only touches base with Rockies, but can be extended to other web properties owned by the Marketing Group.
Some keypoints we put into consideration were the following:
- Search results….what type of results should we expose?
- Crawling and indexing…how do we crawl our domain and index our results?
- Web security…what standards do we need to put in place granted our search server is open-source?
- Third party dependencies…can we bring application ownership in-house?
- Future maintenance…what is our SOP and response time as the domain’s content changes?
- Technology Services protocols…what moving pieces are pertinent to change management guidelines, etc.?
The official release of UoR search went live in December 2013 and continuous improvements are slated throughout the year, so stay tuned. For now, feel free to explore this feature at, www.rockies.edu.
Reviewing the key/value structure of JSON, I came across this discussion on Parsing JSON with hyphenated key names, I thought the same would hold true for mine. That said, I’ve augmented the Stackoverflow suggestion slightly to leverage underscores versus dot syntax and came up with the following:
/* For schema.xml on Nutch and Solr */
<field name="metatag_description" type="text_general" stored="true" indexed="true"/>
<field name="metatag_keywords" type="text_general" stored="true" indexed="true"/>
/* For solrindex-mapping.xml on Nutch */
<field dest="metatag_description" source="metatag.serptitle"/>
<field dest="metatag_keywords" source="metatag.serpdescription"/>
This was implemented on Nutch 1.7 on a Solr 4.5.0 instance.
Please refer to the following for context:
- Extracting HTML meta tags in Nutch 2.x and having Solr 4 index it
- Parsing JSON with hyphenated key names
- Nutch – Parse Metatags
I’m currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.
Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both ‘www’ and ” urls such as:
My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
# skip URLs containing certain characters as probable queries, etc.
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
# accept anything else
Also on Stackoverflow and pastebin.
I’ll be publishing documentation on here as well as Github which will show you how to set up an Apache Solr instance, crawl then index a website with Apache Nutch and finally integrating those results to the front-end with AJAX Solr.
For now, here’s a list of resources which have proven to be helpful thus far:
In regards to my post on Stackoverflow, my resolution to this problem was to update search.js and check the window.location object:
//Old code - from reuters.js example
//Custom query by end-user for my search.js file
var userQuery = window.location.search.replace( "?query=", "" );
In regards to my post on Stackoverflow, I pointed my crawl and index to the location of my collection. In this case:
$ bin/nutch crawl urls -solr http://localhost:8983/solr/rockies -depth 1 -topN 5
$ bin/nutch solrindex http://localhost:8983/solr/rockies crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
Additionally, I updated the -depth to 1 (specifies how deep to go after the link is defined. In this case 1 link from main page) and -topN to 5 (how many documents will be retrieved from each level).