Category Archives: Web Experiments

The Friendship Algorithm in JavaScript

I’m a big fan of The Big Bang Theory and I couldn’t resist playing around with Sheldon’s friendship algorithm. It’s pretty much one object represented as Sheldon guided by the module pattern (thanks to Addy Osmani).

I encourage you to fork this code and get some zen coding in. Since the logic is “safely” protected by Sheldon, you should be able to integrate any boilerplate of your choice for some fun interfaces. Also, feel free to refactor the logic when needed. It could always be better. That said, kudos to Wolowitz for plotting out the loop counter and escape.

The Friendship Algorithm

You can also view the demo here, here.

Simple PHP Proxy returns incorrect JSON from Apache Solr instance

I’ve implemented Ben Alman’s simple-proxy.php to communicate to an Apache Solr instance (in this case my local) outside of my domain.

I’ve followed the instructions in full, the core of which is to set the simple-proxy.php on my domain’s file server.

I’m curious on if there are any modifications that must be made to the proxy in order for the response to be in the correct format?

View on Stackoverflow.

PageSpeed: here.ashford.edu

Page speed seems to be one of the bigger priorities as of late and I was designated with the RND task in making one of our subdomains, here.ashford.edu a bit…faster. In summary, this is a single-page design intended to support our Marketing campaigns during the 2014 Winter Olympics.

In one sprint (2 weeks in my case), we were able to set some benchmarks and reduce our network requests, optimize our images, minify our CSS, minify our JavaScript, compress and cache our assets by tweaking our server-side configurations.

The before benchmarks are listed below for 03/19/2014 unless otherwise stated:

  1. Google PageSpeed – Mobile: 55/100
  2. Google PageSpeed – Desktop: 71/100
  3. Y-SLOW: 67/100
  4. Network Requests: 112 (03/21/2014)

The after benchmarks are listed below for 03/27/2014:

  1. Google PageSpeed – Mobile: 64/100
  2. Google PageSpeed – Desktop: 83/100
  3. Y-SLOW: 79/100
  4. Network Requests: 87

Tools and methods I used to compliment our page speed via server-side:

Tools and methods I used to compliment our page speed via client-side,:

I still have some items to solve—Youtube iframes ultimately being the culprit. Getting into a 1-second threshold with mobile is another instigator. I’m suspecting a JavaScript solution is on the horizon but we shall see.

Crawl Metatags with Nutch 1.7

In regards to the Stackoverflow recommendation on enabling the metatag plugin, I came across a roadblock when I had to merge this solution to my integration of AJAX Solr. Unfortunately, taking the recommendation at face value caused a JavaScript error of undefined when accessing the the meta tag key/value pair from the JSON object. Granted the recommendation chained metatag.description together, it interpreted metatag to be an object that did not exist.

Reviewing the key/value structure of JSON, I came across this discussion on Parsing JSON with hyphenated key names, I thought the same would hold true for mine. That said, I’ve augmented the Stackoverflow suggestion slightly to leverage underscores versus dot syntax and came up with the following:


/* For schema.xml on Nutch and Solr */
<field name="metatag_description" type="text_general" stored="true" indexed="true"/>
<field name="metatag_keywords" type="text_general" stored="true" indexed="true"/>

/* For solrindex-mapping.xml on Nutch */
<field dest="metatag_description" source="metatag.serptitle"/>
<field dest="metatag_keywords" source="metatag.serpdescription"/>

This was implemented on Nutch 1.7 on a Solr 4.5.0 instance.

Please refer to the following for context:

  1. Extracting HTML meta tags in Nutch 2.x and having Solr 4 index it
  2. Parsing JSON with hyphenated key names
  3. Nutch – Parse Metatags

Frustrations excluding urls without ‘www’ from Nutch 1.7 crawl

I’m currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.

Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both ‘www’ and ” urls such as:


www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html

My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?


# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
 
 
# The default url filter.
# Better for whole-internet crawling.
 
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.
 
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
 
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
 
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
 
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
 
# accept anything else
+^http://([a-z0-9]*\.)*mywebsite.com/

Also on Stackoverflow and pastebin.

Resources for Solr 4.5, Nutch 1.7 and AJAX Solr

I’ll be publishing documentation on here as well as Github which will show you how to set up an Apache Solr instance, crawl then index a website with Apache Nutch and finally integrating those results to the front-end with AJAX Solr.

For now, here’s a list of resources which have proven to be helpful thus far:

Success integrating AJAX Solr with Solr 4.5

In regards to my post on Stackoverflow, my resolution to this problem was to update search.js and check the window.location object:


//Old code - from reuters.js example
Manager.store.addByValue('q', '*:*');    

//Custom query by end-user for my search.js file
var userQuery = window.location.search.replace( "?query=", "" );
Manager.store.addByValue('q', userQuery);