June 20 2019 04:29:51
· Home
· CV
· Articles
· Links
· News Categories
· Media Gallery
· Search


Forgotten your password?
Request a new one here.
My Next Project...
SoftwareMy next project is to write a search engine (Why? Who knows - I just thought it might be something useful to have in a library/ portfolio). I've already begun on the development, so far I have a "StringTree" which lets me load in objects that encapsulate a result and a string. The object is indexed against the string. For example, if we had an object that contained the string "pinecones" and added it to the tree, nodes for 'p', 'i', 'n', 'e', 'c', 'o', 'n', 'e', and 's' would be made. The final node 's' is where the object would reside. With this structure one can search for whole strings easily as it is just a matter of traversing nodes and their childs until we reach a dead end or stop on a node. If there is a dead end, there is no result. If we stop on a node and it has a child, then we have a result! What's cool is that you can also do trailing wild card searches with this simple structure. For example "pin*" would traverse the tree to 'p', 'i', 'n' then ask the 'n' node "get me all of your results (if any) and all of your children's results (if any). So it might get "pine", "pinecone", "pinecones". Finally, what if we want to do leading wild card searches? Well, another index needs to be added to the root of the tree. The index required is one that points to ALL nodes in the tree but indexed upon character itself. For example, we have a collection of all 'a' nodes, all 'b' nodes, all 'c' nodes etc... Then, if we search for "*ine" what we're actually saying is "start from all 'i' nodes". So for each indexed 'i' node we traverse to 'n' then 'e' and aggregate the results thus we might find "wine", "dine" and "pine". Simple, no?

This may sound like quite a slow system having to traverse so many links, however I wrote a simple tester and got the following results:

Load time: 210ms
Loaded objects: 108729
Search time: 30ms
Results: 108729

So it takes 210ms to create 108729 objects and index them and a further 30ms to search for "*" thus traversing ALL nodes and returning ALL results. Pretty fast!

I've also taken and cleaned up the Snowball stemming system (taken out the needless use of reflection - yuck) and put it into my libraries. This will make searching better for single "stray" terms as "horse", "horses" and "horsing" will all resolve to the same stem of "hors" so will essentially be treated as the same word. It will also reduce the memory overhead as less objects in total will be required when indexing.

(No idea what my fascination with pinecones is, sorry!)
760,339 unique visits