Have you ever wondered how search engines like Google find stuff?
At least I did so and what does a person with little time on hand and exams on the way? Begin a new project, exactly!
Since I like writing Rust 1 I told myself: build your own search engine in Rust. At least the crawler and API. So I got to work and actually finished a basic prototype a couple of days ago and today I added displaying a description in search results. This is why I am writing this, most of the search is finished.
My plan
Before beginning such a project one should always make some kind of plan that is better than: 1. build search engine, 2. idk, 3. profit!
So I actually did plan quite a bit. Firstly I wanted to define what kind of search I would build, since that would basically define if I worked a couple weeks on it or some years (I am a professional idiot, so that time frame is realistic). I decided on one of the most basic kinds of search TF-IDF. What that is I will describe later, but be assured it is neither fancy nor particularily good (partly also my implementation). Then I thought about how I would crawl the interwebs, since that is actually harder than you think. Further I didn’t want to send too many requests to webpages and not ignore the robots.txt file, since I want to build a nice bot 2. This made it a little more complex.
Then the crawled data needs to be stored somehow, in a database. But as I said, I had not a lot of free time at the time, so I decided to just use postgres 3 and for the static parts I would have for the frontend, caddy looked perfectly fine for, especially since it made configurations quite easy and offered easy reverse proxying too.
Everthing is its own container, so search-api, database, webserver and crawler are seperated. This also helps with orchestrating what starts when, since the api and craler expect the db to be up and will else just quit, what else should they do?
That was my basic plan, at least the short version hehe
TF-IDF
Term frequency (TF) and inverse document frequency (IDF) is one of the simplest ways of implementing a keyword search, since you literally just search for how often a keyword comes up in a document versus its full length and how many documents have that keyword in it compared to the total number of documents. This is then added up and voilla, you have a search ordering. It is not good, since you won’t weigh the importance of the links to the page, you also don’t consider that words can have different meaning in different context, and so on, but for just having a small search it is totally fine. Since I was interested in search algorithms I looked into others as well, but decided that I would stick to this for now for simplicity.
Implementation
Every page will be crawled for unique words and how often they occur in the document, which will be described in more detail below. Now that data lays in a database and can be queried. So the search API will go and find all results for each keyword and calculate the idf for each keyword. Then for each result it will calculate the tf. The tf and idf for each term and page will be multiplied and the results for each keyword will be added up. So if I search “cat food” it will multiply the tf of cat and the idf of cat for each result and add the multiplied tf and idf for food on it. This is then sorted and returned.
The results are returned as JSON, since that makes parsing in the web interface much easier and I implemented a basic paging mechanism, to not send hundreds or thousands of results to the frontend each time. This is just basically having a page number that indicates which slice of 25 results in the total result list should be returned.
Web API stuff
For the webserver, lazy as I am, I used Rocket.rs 4 and added just a single endpoint /api/search, which then expects at least ?q= to be set and &p= to be optional. These two form values I took from most other searches, whereby q is short for query and p stands for page.
This data will then just be forwarded to the search logic. And the JSON returned will be send back to the client, since Rocket supports returning JSON.
But there is more, I teased earlier that I would have a static part, which will have the web UI, that is served by caddy, but it is the reason why I put the search on /api/. Since Caddy can proxy specific paths to a reverse proxy I told it to forward /api/* to rocket and serve static pages for the rest of the requests. This makes the whole thing much easier and the frontend can be easily replaced or discarded all together.
Crawler
The crawler is a bit fancier than the search, since it is actually multithreaded and that was build by me, although Rusts mutex’s and Arcs really helped. This was the easiest multithreading work I’ve done so far.
Okay, so how does it work? Basically the crawler will be told how many threads it is allowed to use. For each of them it will spawn a thread that will then request pages. The links found in those pages will be put into a list, if they have not been already crawled (to know what has been crawled there is a second list). When a crawling thread finished its work on a page, it will request a new link from this list. If there is none, it will first check if another thread is still working, since that could potentially add more links in the future, if not it will terminate itself.
The web page will be requested, after checking the robots.txt file. Since the threads will randomly request pages at the moment, each time the robots.txt will be loaded to make sure that even in longer runs the file has not changed and to spare my RAM from filling up too much, storing all those files. In the future I may improve this a little by assigning a list of domains to each thread, so they can keep the robots.txt for a while before having to rerequest it. But that is a problem for future me.
Now when it has the HTML of the page, it will do a couple of things. Firstly it will extract all links it can find, then it will also get the author and description from it. If it can not find the appropriate tags, it will just put a pre defined value in, that will indicate a missing result, something like “No author found.”, which can then be returned by the API and displayed or not displayed by a the web UI.
The most important step though is the extraction of keywords. For this it at first converts the whole body of the HTML to markdown, since like this all text will be easier to parse. After that it will strip all special characters and leave only spaces. 5
With a long string now the crawler can split keywords at spaces and count them. It will have a list of already found keywords and their number of occurances. Each time it finds it again it adds one. If it finds a keyword that is not in the list, it will add it and put a one for the number of occurances. This list then will be added to the database. Each keyword will get its own entry in the table “index”, which will make searching through it much easier.
Web UI
The Web UI is mostly some HTML and CSS styling everything and then some JS that will send a request to the server, when enter is pressed on the search input.
The returning JSON is then parsed and for each result a new div will be added to the search result div, each containing a link to the page, the descripion and author. All styling is done by CSS as it should be.
-
For those 2 people on the internet interested in programming, that were not already forced by some nerd to learn what rust is: https://rust-lang.org/ ↩︎
-
If you want to block my crawler its user agent will always be minisearch/version, whereby the current version is just ‘0.1.0’ :) ↩︎
-
Which I regretted later, since I am used to the SQL dialect of MySQL, which is different to that of Postgres. But all in all Postgres is nice. ↩︎
-
To be precise, I use /W to find all non word characters with a regex and replace them with spaces. ↩︎