Hardened Metasearch

Jan. 1, 2024 [hardening] [privacy-security] [guides] [libre] [technology]

All too many resources suggest using smaller privacy respecting search engines, such as Duckduckgo, to avoid Google’s search monopoly. While this mitigates the issue of tailored results and feeding big tech, it is only just a first step. With those alt search engines, one still places trust that they are not selling data, logging, or tracking in some other capacity. Consider that many of these “privacy respecting” alternatives have also been found snooping while others have been bought out by advertisers.

But what if you didn’t need to trust the engine responding to your queries on the other end? With the right tools, it is possible to build a trustless, distributed search portal!

We can probably assume anyone reading this will already have a trustworthy computer and web browser as well as some familiarity with terminal and configuring software. If not, there will be plenty more in hardening posts to come.

If you do not already have Tor installed, now is a good time to install it through apt. We will also need some other prerequisites. Searx is a metasearch engine which can hook into external engines to conduct queries. While there are public instances of Searx hosted around the web, we will want to run our own locally:

apt install searx python3-socks

Copy the configuration file into place and rewrite the placeholder key:

cp -p /usr/share/doc/searx/examples/settings.yml /usr/lib/python3/dist-packages/searx/
sed -i -e "s/ultrasecretkey/`openssl rand -hex 16`/g" /usr/lib/python3/dist-packages/searx/settings.yml

Open the config file at /usr/lib/python3/dist-packages/searx/settings.yml and locate the section for proxy information. Set Tor socks5 as the only proxy.

    proxies:
        https:
            - socks5://localhost:9050
    using_tor_proxy : True

I recommend proxying images as well to avoid leaking data.

image_proxy : True

Try to select only a handful of search engines to keep active. Using too many could create opportunities for adversaries that partner share data to correlate search requests. Comment out or delete the rest. And since this is being routed through Tor, don’t feel obligated to avoid large engines like Bing. They will only see a request originating from some exit node.

Increase the timeout value on any engine you select by a few seconds, otherwise Searx may timeout those queries before it completes traversal of slower Tor circuits. As of SearX 1.0.0, there is a global timeout that can be enabled for when proxied through Tor.

    extra_proxy_timeout : 10.0 # Extra seconds to add in order to account for the time taken by the proxy

Create a systemd unit file to control the Searx service:

touch /etc/systemd/system/searx.service
chmod 664 /etc/systemd/system/searx.service

Edit the new file at /etc/systemd/system/searx.service to include:

[Unit]
Description=Searx metasearch engine
After=network.target

[Service]
Type=simple
ExecStart=/usr/bin/searx-run
ExecReload=/usr/bin/kill

[Install]
WantedBy=multi-user.target

Finally enable and start the Searx daemon:

systemctl daemon-reload
systemctl enable searx.service
systemctl start searx.service

Now when you launch your browser, you should be able to navigate to the local address at http://localhost:8888

Search results will list the source engine underneath each result so you can get a sense of the types of results produced by each engine. There’s just no hiding these super secret pancake recipes!

Ideally, Searx is only available directly to you, on your own machine, unless you make it available over the network through Nginx or Apache. Let’s take a broader look at what has been assembled:

Searx over Tor overview

Configured this way, Searx will make search queries by POST requests which limit identifying data received by recipient engines. Parties resolving the queries will not even see the originating IP, just some random request arriving from an IP associated to Tor network. Also your ISP will no longer be able to infer when or to whom you have conducted a search. Results censored by one engine, will unlikely be censored by all of your other engine choices making for a censorship resistant solution. Lastly, enjoy your new freedom from the chilling effect, that ominous, ever-present uncertainty of being watched. Well, at least for your web searches.

If you’d like to go a step further, consider bringing even the search index into your own turf by running a local YaCy instance. Searx even has a YaCy template to push queries to a locally running YaCy instance.

  - name : yacy
    engine : yacy
    shortcut : ya
    base_url : 'http://localhost:8090'
    enable_http: True # required if you aren't using HTTPS for your local yacy instance
    number_of_results : 5
    timeout : 3.0

Update Q4 2024: It appears that search vendors have grown wise to this kind of querying and unanimously block Tor since early-mid 2023. The configuration described here may still yield the occasional result, but the majority of searches will time out with unreachable engines. It has become clear that meta search engines are not the way forward, sufferring from the perpetual frontend dilemma.