Building a search engine from scratch for fun and... expense?
By Thomas Pain
This post has the following tag: diysearch
I've decided that one of my summer projects will be to build a search engine and accompanying crawler/spider (I'll end up calling it both from time to time) from scratch, because I feel like it, while also turning it into a series of posts on this blog.
This specific post is going to be dedicated to planning the project out - defining the components, the architecture, the database schemas, and some other stuff.
But first - we need a name, because the name is the most important part of the project! I'm calling my search engine Surchable, which I lifted from an Amazon Original series called Undone. In S1E6, one of the characters, San, is using a search engine called... Surchable! I stole the name.
As part of this project, I want to:
Accordingly, there's also some anti-aims. I don't want to:
Cache-Controlheaders, ratelimiting and the like)
There's going to be three main components to Surchable.
Let's start with the crawlers, of which we'll run multiple instances of at the same time.
/ route (or another starting point specified by the coordinator), scanning that for hyperlinks and collecting data about that page, then scanning all other pages it can find based on those hyperlinks by repeating the same process.
Before it requests any page, it'll check in with the coordinator to make sure that this page hasn't been scanned recently to avoid hammering a given site.
After every page load, the crawler will submit information about a page to the coordinator. When the entire job is finished, the crawler will tell that to the coordinator too, and then move on to processing another job.
But what happens if the crawler crashes mid-way through a job? If we add a rule to the coordinator to say that any in-progress job should see the crawler that's working on it check-in at least once every
x minutes, and a crawler doesn't fufil this, we assume the crawler's gone offline for some reason and release the job to be assigned to another crawler.
The coordinator itself will be the thing building and updating the master search index and page metadata tables, as well as managing job coordination, as mentioned earlier. Without going into huge detail about its inner workings, there's not a lot to say about it.
You can kinda see the whole process that we talk about above in this Swimlanes diagram.
Finally, the web UI. This is the only thing a user can see and interact with - it will allow someone to input a queryand get a list of results out and submit a URL for scanning. It'll have to perform result ranking, which (at least, for now) I plan to do using a combination of a relevancy filter and a bastardised version of the PageRank algorithm that was (is?) used by Google.3
The relevancy stuff is going to be done with some fancy maths involving frequencies and logarithms that I'll go into in more detail in the future.
I've heard Golang called "the programming language for the internet" - it excels when used for writing networking or web applications, and for that reason it's going to be my primary language for this project. It'll be used to build the server for the web UI and the coordinator. This also allows core libraries, like the database access libraries, to be easily shared across the different applications that make up the search engine.
The actual crawlers will be implemented in Python, due to the maturity of libraries like BeautifulSoup and Selenium or Playwright for Python.4 I'm not yet decided if I want to use instances of headless Chrome and similar for the scrapers yet, or if I want to just request plain HTML and process that. I'm leaning towards the latter since it's considerably simpler and since it shouldn't matter if we don't have JS support for what we're doing, but we'll have to wait and see.
The database will be built around a database server, likely PostgreSQL. Typically, I'd use a single-file database, such as SQLite3, but this isn't feasable when both the web UI and the coordinator will be accessing the same database at the same time.
I plan to make a series of ~4 blog posts about Surchable, this one included. Development won't be starting for a couple of weeks yet due to real-life goings on, but this is going to be my summer project. I have both an Atom feed and a JSON feed if you want to subscribe for future posts. If you've got any questions at the moment, shoot me an email or send me a friend request on Discord (check my homepage for details of both) and I'll be more than happy to chat with you.