I spent today tweaking and fixing issues (such as invalid hash collisions meaning loss of data in the indexes and word offsets not being added correctly). Perhaps of more importance, I added monitoring to the backend using my monitoring framework! This is the first time I've really been able to test the SearchEngine properly and here is how it looks
I've blurred out the content's because it's currently indexing the RuneScape knowledge base (hey, since I wrote Jagex's SearchTool libraries, I figured I'd use the same dataset for SearchEngine!). The data was all legally obtained via scraping their website with this little BASH script:
while [ $i -lt 9999 ]; do
wget http://www.runescape.com/kbase/viewarticle.ws?article_id=$i -O $i.html
i=`expr $i + 1`
So, how does it perform? Pretty well! It's not as fast as the implementation I wrote for Jagex (we used custom libraries which were a lot faster), however it's pretty close:
# Hash Fields: 2
*  - Word count: 983
*  - Word count: 10605
# Trie Fields: 2
*  - Word count: 1249
*  - Word count: 25428
# Average search time: 4ms
# Longest search time: 17ms
# Shortest search time: 0ms
# Search Count: 9
After only 9 searches, the average is down to 4ms (the longest average is always the first search since the JVM does initialisation when processing the first search). I believe the average search times at Jagex were on the order of 1.5ms, but I can't guarantee this figure. Not bad, though!
Today I realised that I have forgotten a pretty crucial part of the implementation - search logging! So that's what I'll be working on next.
I finished my search engine yesterday. It didn't go as well as planned. The String Trie that I had written functioned fine, but as soon as it became loaded (with 40,000+ nodes) it became intolerably slow. I'm blaming this on Java's collection implementations because I had no such issues when I created the search engine for RuneScape
. This means that I have omitted the trie from the implementation meaning that no wildcarded searches can be performed (not a _huge_ loss, but still frustrating). It still supports phrase searches, stray/ stemmed searches and "NOT" terms along with result highlighting. My original implementation for Jagex took about 4 weeks to complete so I'm quite pleased that it only took me 2 days this time around. Maybe if I feel the need for wildcarded searches I will write my only collection implementations but I'd rather not deviate from the Java libraries too much!
So, what next? I'm not too sure. I think I'll add web-based monitoring to the SearchEngine so I can view which words are indexed, average search times, number of searches etc... After that I might toy around with a system to send data in UDP packets between programs (kind of like RMI but simpler).
As a shot in the dark, I tried upping the heapspace available to the JVM and this fixed the problems of slowness when indexing with the StringTrie! I've just spent the past hour or so putting the StringTrie back in place and making persistence more efficient. I can do wildcarded searches on the entire works of Shakespeare in 19 milliseconds now! Initial indexing takes a little while (under a minute for all of Shakespeare's work), reloading from a persisted state takes about 9 seconds.
My next project is to write a search engine (Why? Who knows - I just thought it might be something useful to have in a library/ portfolio). I've already begun on the development, so far I have a "StringTree" which lets me load in objects that encapsulate a result and a string. The object is indexed against the string. For example, if we had an object that contained the string "pinecones" and added it to the tree, nodes for 'p', 'i', 'n', 'e', 'c', 'o', 'n', 'e', and 's' would be made. The final node 's' is where the object would reside. With this structure one can search for whole strings easily as it is just a matter of traversing nodes and their childs until we reach a dead end or stop on a node. If there is a dead end, there is no result. If we stop on a node and it has a child, then we have a result! What's cool is that you can also do trailing wild card searches with this simple structure. For example "pin*" would traverse the tree to 'p', 'i', 'n' then ask the 'n' node "get me all of your results (if any) and all of your children's results (if any). So it might get "pine", "pinecone", "pinecones". Finally, what if we want to do leading wild card searches? Well, another index needs to be added to the root of the tree. The index required is one that points to ALL nodes in the tree but indexed upon character itself. For example, we have a collection of all 'a' nodes, all 'b' nodes, all 'c' nodes etc... Then, if we search for "*ine" what we're actually saying is "start from all 'i' nodes". So for each indexed 'i' node we traverse to 'n' then 'e' and aggregate the results thus we might find "wine", "dine" and "pine". Simple, no?
This may sound like quite a slow system having to traverse so many links, however I wrote a simple tester and got the following results:
Load time: 210ms
Loaded objects: 108729
Search time: 30ms
So it takes 210ms to create 108729 objects and index them and a further 30ms to search for "*" thus traversing ALL nodes and returning ALL results. Pretty fast!
I've also taken and cleaned up the Snowball stemming system (taken out the needless use of reflection - yuck) and put it into my libraries. This will make searching better for single "stray" terms as "horse", "horses" and "horsing" will all resolve to the same stem of "hors" so will essentially be treated as the same word. It will also reduce the memory overhead as less objects in total will be required when indexing.
(No idea what my fascination with pinecones is, sorry!)
So I've spent the past week developing a modular system monitoring interface for my software that plugs into SlXHTTPd. Since I already had my bot partially written I decided it would be a good place to try it out. Here's how it looks
. The interface also supports paginating data with a rotating page selection carousel
I've refactored the script compiler out of the HTTPd and made it into a generic in-memory compiler so that SlXBot can take advantage of it and recompile/ reload it's modules on-the-fly without rebooting and losing connection.
Aside from my IRC bot (SlXBot), I've been working on a new version of my HTTPd. Originally, my HTTPd was called "ShAPAChE", which seemed a little corny. It was coded in C++ as a university project (it was barely object oriented because, sadly, my lecturer on the subject was not a great OO programmer - more of a hacker). Anyway, since then I have finished university and had some real-world experience with OO programming in Java... so I decided to re-write it in, yes, Java! It's going very well. I've hit several milestones:
It can run PHP and CGI,
It has its own language, which is a mixture of HTML and Java and is compiled dynamically in memory when a page is requested meaning that pages do not have to be precompiled,
It has speed limiting and connection limits,
POST works fully,
Directory listings work,
Inbound header sizes can be limited to avoid people posting large data to the server,
It has full logging,
It has configurable mimetypes,
It has a configurable ban list,
Options are configurable via a config file,
It supports keep-alive,
It supports chunked transfer encoding...
I plan to add a few more features to it such as vhosts, .htaccess support and .htpasswd support (and maybe even gzip support if I find myself with nothing to do). After this, I may consider dumping Apache and using SlXHTTPd full-time!
The architecture of SlXHTTPd is pretty interesting. By itself, it can be run as a stand-alone server. However, it can also be "plugged" into any Java project which can communicate "Arguments" to the HTTPd to be rendered or used in the processing of a page. A plan I have for the future is to integrate SlXHTTPd with SlXBot so that statistics may be viewed in a web browser along with channel logs etc...
This leads me onto my next project which will be a way for programs to communicate without needing to be linked (i.e. they will not need to be part of the same program). Instead, data could be communicated between programs using UDP packets. Anyway, more on that later!