Announcing Couch Crawler, a CouchDB search engine/crawler

Hi! So, for fun, I made couch-crawler, a search engine and crawler on top of the very excellent couchdb-lucene. I wanted to create a hackable search engine for my work intranet using modern tools. Lucene is great, but the Nutch search engine/crawler was kind of annoying to work with. I couldn’t figure out how to get it to update the search indexes without a restart of the server, which sucks. Also, I just really, really like CouchDB.

There’s no real web tier, CouchDB hosts static JavaScript/HTML/CSS files and the UI gets built up dynamically with AJAX calls to CouchDB. It’s kind of nice to be able to cut out a whole layer of glue code.

Templating is done with mustache.js, a JavaScript templating language that does a good job of being a dumb template language, making you define your presentation logic in JavaScript, where it should be.

On the indexing side of things, there’s a crawler written in Python that pulls down html, parses it with BeautifulSoup, extracts useful text content to be indexed then follows links within the page to a specified max depth. It probably could be smarter and parallel-er, but I wanted to start with a simple design and iterate over it.

The couchdb-lucene indexer indexes the title, url and contents, and saves the first 140 characters from the contents in the index to display with search results.

Ch-ch-check it out and let me know what you think.

P.S. If you use Homebrew for your OS X packaging needs, I have a fork of homebrew with a couchdb-lucene formula for easy installation.

Announcing Couch Crawler, a CouchDB search engine/crawler

http://syntacticbayleaves.com/2010/01/17/announcing-couch-crawler-a-couchdb-search-enginecrawler/

Announcing Couch Crawler, a CouchDB search engine/crawler