28 Nov 2011

Dissecting the Duck – DuckDuckGo and web search in general

Submitted by blizzz

Have you already heard about a search engine called DuckDuckGo? I am aware of it for, I think, far more than a year. And recently I have (subjectively) noticed a rising amount of posts related to DuckDuckGo, especially in the Open and Free Software Aspects (while I was writing this post, Linux Mint announced to start using DuckDuckGo as default search engine). It looks like more and more people are looking for an alternative to Google, which does not track users' data. I have had a critical view on DuckDuckGo before and still have it, though undoubtedly it has some positive facets.

DuckDuckGo

To take away some illusions beforehand: DuckDuckGo is neither free nor open source software, but proprietary, and only some parts are open-sourced. Although i have not had a deeper look into the source, it pretty much looks like that the ranking algorithm e.g. is not open. They only stress the importance of inbound links, which is hopefully not a too strong ranking factor...

Nevertheless, the best way to get good rankings (in pretty much all search engines) is to get links from high quality sites like Wikipedia. Source

And unfortunately they have (or communicate) neither an identi.ca nor Diaspora account.

Can DuckDuckGo be an alternative for Google?

We should have a look at what a search engine actually is. Let's do it very simple: there are two main tasks. First: the crawling of websites. A search engine should continuously be searching the web. As we know the web is huge and growing. Therefore crawling is a large, expensive and boring task. Second: the delivery of results according to a query. These results should be good. Really good, which means they should answer the query and be free of spam. Therefore ranking algorithms are used.

DuckDuckGo result for Kubuntu
First results for "kubuntu" on DuckDuckGo.

Of such (general purpose) web search engines there are not many. Actually, I am aware of only three (leaving DuckDuckGo out). There are of course Google and Bing which do both jobs. There is the small and US focused search engine Blekko. Once, there was Yahoo, but meanwhile their results are served through Bing. Maybe Ask.com was doing some crawling, I am not too sure of that. Nevertheless: this is the full list of search engines that have some significant number of users.

Then there are many many search services out there, that use the index of other search engines. For example, in Germany the searches of T-Online and Web.de are relatively popular. Both of them use Google. There are some other search services, that incorporate more than one search engine index. They are called Meta Search Engines.

DuckDuckGo does a hybrid approach. They have their own crawler and also include crowd sourced websites to their index. Overall, they say that they make use of over 50 sources, including Bing, WolframAlpha and Blekko. So basically it is putting everything you get into one pot and making something out of it.

Crawling is simple and still the main problem

Using such loads of sources has the advantage of finding the highest amount of websites, at least in theory. My assumption is that sites that a) want to be indexed and b) are good and thus shared should find their way into any index earlier than later. You may collect more sites, but not of a remarkable quality. Still using a number of sources is cheap and you get everything you want. And still it is the Achille's heel of DuckDuckGo: they are dependent on Microsoft's Bing. Doing the crawling yourself is in a way redundant, but it is the base of a reliable search engine.

The company behind DuckDuckGo communicates the problem very well by themselves:

It costs so much that even big companies like Yahoo and Ask are giving up general crawling and indexing. Therefore, it seems silly to compete on crawling and, besides, we do not have the money to do so. Instead, we've focused on building a better search engine by concentrating on what we think are long-term value-adds – having way more instant answers, way less spam, real privacy and a better overall search experience. Source

This fact highlights a big issue in nowadays search engine reality: the search foundations of our society are in hands of two US companies: Microsoft and Google. The starting point for censorship and manipulation is there. (Side note: additionally, there are Yandex/Russia and Baidu/China for other character sets, still we have a central infrastructure).

So, is it an alternative?

You are not being directly tracked by DuckDuckGo and other search engines they use, so this is an advantage. But you can also use e.g. Google without being tracked, for example by the Private Browsing mode of your browser or simply by deactivating cookies or using solely session cookies for that service in question. DuckDuckGo has also some nifty features (called Bangs), but most of them are also included in one way or another in the other search engines. If you are satisfied with the results, DuckDuckGo is an alternative, yes. Keep in mind that by using DuckDuckGo you also use Microsoft Bing... ;)

In my opinion, DuckDuckGo is neither an ideal nor a long term solution.

What does the free world of tomorrow need? YaCy to the rescue?

A (general purpose) web search engine is an important part of the Internet. If you look for something, it is usually the first starting point of your (re)search. Being an important pillar, we should look for a solution that is independent from companies or states and follows a distributed architecture. In social media, there are already good alternatives like status.net for microblogging or Diaspora as general purpose social network. We can run them ourselves on our servers and connect to the whole community.

A technology like this is very desirable. In fact, there is already the YaCy technology around which is used primarily for larger organizations, though for web search it has its weaknesses*. Another interesting idea might be to design a search feature for distributed services like status.net and Diaspora. People are already posting a lot of information on it, including URLs. Can we detect the important stuff with a crowdsourced approach, by indexing only webpages that are posted by the users? And include some ranking factors based on social factors, so that spammers do not have a chance?

* Update: Now, YaCy released version 1.0 of their P2P search software (German). In spite of their former strategy, they now communicate offensively the possiblity of web search using the YaCy network and offer a search page, which has not been done before, iirc. This interesting move is highly exciting – to an extent, that also the Free Software Foundation Europe spreads the information. Plus, its president Karsten Gerloff stresses it in an extra post. I look forward to see if the software and service is being adopted and can convince with rising quality.

Comments

It's a shame, you forgot to mention ixquick, the best se out there.

ixquick is yet another meta search engine, i do not see a sense to make a list full of those? Plus I neither looks free nor open.

Another potentially interesting project is Seeks: http://www.seeks-project.info/

Yes, Seeks is using Bing and Google...

sure but the idea is not to re-crawl the web but to share results and recommendations collaboratively. Seeks allows you to only retrieve seeks results without a call to other engines.

For sure, starting from zero they need an existing source database to provide the first results. It seems they don't want to rely eternally on commercial search engines however, take a look at the latest paragraph of their roadmap: http://seeks-project.info/wiki/index.php/Roadmap

Add new comment