White Papers

The 20-Point Checklist for Search Engines

There are many search engines available on the market. Although most people understand what a search engine does and have experience in using search engines, there is a big difference between using a search engine and selecting and running your own search engine for your business.

A search engine user only needs to know how to specify queries, which after all is rather simple. However, when it comes to selecting a search engine for your application, you need to have a basic understanding of the technology and know a lot more than writing queries, because you need to know how good a search engine is in terms of information collection (crawling or spidering), speed, scalability, reliability, and flexibility, to name just a few.

This document discusses twenty points which we believe you, as a decision maker, should know about search engines.

DON'T BUY A SEARCH ENGINE WITHOUT FIRST STUDYING THE CHECK LIST!

The Fundamental

1. What platforms does the search engine and spider run on? Is it portable?

Search engines require a lot of codes in network and file accesses, which are often done using native support from the operating system to improve speed. As such, search engines are difficult to port across platforms. With the rapid evolution on computer hardware, you don't want to tie yourself to a particular hardware/software platform.

2. What programming languages is it written with? Is it internet/web enabled? Don't give me Fortran!

This is not a technical question just for the engineers. A good programming language is an important factor for the quality, portability, and customizability of a software system. Programming languages evolve very fast as well. In the last three decades, we have seen main-stream programming languages moving from Fortran/Cobol, to C/C++ and now Java. The newer programming languages have much better support for engineering large-scale, reliable and extensible software: object-oriented programming methodology, design pattern, APIs for all essential network, data and enterprise programming, and integrated support of many required features such as multi-threading and unicode. Available on all of the advanced operating systems, software developed in these languages can better utilized advanced resource sharing and scheduling mechanisms on the operating system.

Although search engines are new products, but many search engines (or re-dressed ones) may still use codes developed twenty years ago for the most critical data access components. Likewise, systems developed more than ten years ago are likely based on non-standard C and are even more likely to make use 3rd party or customized codes to handle multithreading, unicode and process scheduling.

In the fast growing Internet world, unfortunately, older is not necessarily better.

Index and Search Speeds

3. What is the scalability of the system? Can it exploit multiple servers and CPUs?

If you run Internet and Intranet portals, you would welcome the pleasant surprise that usage is higher than what you expected. Can you improve the search performance by simply adding more main memory, more CPUs or more servers? If the search engine is written as a single process, it is unlikely to take advantage of multiple CPUs. The search engine needs to be multi-threaded. For even higher workload, you want to be able to partition and locate the index database on multiple servers and that the result from several servers can be combined into a single result set. Note that this is different from replicating the index database on several servers and distribute search requests among the servers.

Scalability is equally important for the spider and indexer, especially if you are running a search portal. Unlike search engines, spiders are more difficulty to scale up. Since spiders have to spend a lot of time on network and data operations, adding CPUs doesn't help the performance a lot. Running more than one spider, whether on a single machine or several machines, poses problems as well because while search engines don't have to coordinate between individual searches spiders have to be coordinated so that they don't crawl the same page more than once. Furthermore, coordinating and synchronizing updates on different index databases is not an easy task at all.

4. What about the speed of the spider? Don't forget to ask for insertion speed!

Search speed is of course essential. However, insertion speed is equally important for Internet portals because each day there are thousands and thousands of web pages newly created or updated. The spider has to index each and every one and insert the index data into the index database. Do you know inserting one single document containing 1000 words could be 2000 times more expensive than executing a single-word query? Yes, it is! If the index is not designed to support a large number of updates, you will see insertion time to grow exponentially as the number of pages in the index database increases.

The Data Space

5. Is it designed to search the internet, intranet, and your local disks?

Although internet and intranet are based on the same technology, they have different requirements on search engines. For instance, enterprise portals are more concerned with security, variety of document formats and document properties, freshness of information on the search engine, and integration of corporate data with personally collected data on local disks. On the other hand, internet portals are more concerned with speed, scalability and flexibility in dealing with diversified web sites.

6. Can it handle different file formats, such as ASCII, HTML, WORD, PPT, etc.

The web doesn't just contain HTML files. Many corporate documents are WORD and EXCEL files whereas most presentations are in POWERPOINT. The search engine must be able to index and search a large variety of file formats. This is especially essential for corporate portals.

7. Does it support BIG5, GB, and UNICODE? And do so efficiently?

On Internet, especially in Asia, the search engine must be able to index and search double-byte encoded characters, which include BIG5, GB, and UNICODE. Many well-known search engines, for obvious reasons, were developed for English and thus handle ASCII codes only. It is not easy to extend an ASCII-based search engine to deal with double bytes without heavy penalty on reliability, search speed and storage overhead.

8. Is it designed and optimized for the Web? Don't give me a relational engine!

For convenience, lower development cost and a variety of other reasons, many search engines are built on top of a relational database. In these cases, search is basically done by the relational database system; the "search engine" only handles the queries and formats the results. It is fine until you realize that a relational system is extremely poor for searching text. Keywords are stored and "normalized" in tables which must be further indexed by the relational database systems in order to give acceptable speed to search requests. It means that you must create indexes on indexes. The end result is that you, as the end user of the search engine, would spend a lot of storage and ends up with marginal speed.

Chinese/Asian Language Support

9. Is it an English search engine retrofit with Chinese search?

Many well-known search engines were developed with only English in mind. English is computationally a simpler language than Chinese because it has a small character set and clear word boundaries. Thus, the index and search algorithms are much simpler than those required for Chinese language. Because of the design constraints imposed on the design, it is very difficult to make a GOOD Chinese search engine on top of an English search engine. Unfortunately, many "Chinese" search engines in the market were developed out of English search engines.

Spider Functions

10. Can you control what the spider indexes and how frequent it indexes?

The internet is huge; chances are that your intranets are not small either. What is more, different kinds of data require different degrees of "freshness". For example, your company's retirement policy won't change more than once every several years, right? On the other hand, your work plans, sales and marketing plans, management directives change on a daily basis. For retirement policy, you want the search engine to reindex the content only upon request; for work plans and other frequently updated data, you want your search engine to reindex the content every half an hour so that your employees can always find the latest information.

11. Is the spider/crawler fault tolerant? Can it endure link or host failures?

In most cases, search is rather fault tolerant since each search request is short. Higher reliability can be achieve by mirroring the index database which is more a web server problem than a search engine problem. However, faulty tolerance of the spider/crawler (the software which grabs and indexes your documents and web pages) is of utmost importance since it has to deal with an world of unknown and unreliable web servers and domain name servers, and unreliable network links. The spider/crawler must be able to continue to run correctly despite of these uncertainty.

Search Functions

12. Does it support full Boolean queries and relevance ranking?

Many search engines support both Boolean operators (AND, OR, etc.) and ranking of results by scores. Pay attention to the support of nesting such as ("Hong Kong" or Macau) AND tourism and the support of NOT. Also pay attention to how good relevant ranking is.

13. Can you search by dates or by categories?

Many search engines allow you to restrict web pages to within a certain period of last modification dates. Category-based search is more subtle, because it requires web pages or documents be grouped according to some criteria and search be confinable to a certain subset of categories. This often imposes a speed and/or storage penalty on the search engine.

14. Can you search files that are on a specific host or of a certain file type?

The prerequisite to this question is that the search engine is able to index different document formats.

15. Can you specify partial words (e.g., econom* and *port)?

Partial search is very important for English. E.g., "econom*" will match economic and economy. You may also want prefix truncation (as in "*port" which matches cyberport and airport) and infix truncation (as in "wom*n" which match woman and women). Note that these features will inevitably requires large storage and more computational time.

16. Can you expand and translate a query?

Expanding a query (sometimes called query transformation) is very useful in handling synonyms and cross-language retrieval (i.e., you specify a Chinese query and the search engine can return English documents matching the query). Translating a query is of course much simpler than translating a free text.

17. Can it be optimized according to user behaviors?

Personalization is the norm of Internet application. On the system level, the search engine should be able to search faster for commonly search words. On the user level, the user should be able to set default values for display format, ranking policy, etc.

18. Can it search across secured servers?

For corporate portals on intranet, it is essential that information stored on secured servers (e.g., SSL enabled servers) can be index and searched. On the other hand, the search engine must not allow unauthorized users to search sensitive data and in fact must not let unauthorized users to derive any clue that any sensitive data even exist on the search engine.

Support

19. Can the vendor customize the system at a reasonable cost and turn-around time?

Sure you have heard about search engines developed by international multi-billion-dollar companies, but they charge multi-billion-dollar as well. They certainly can customize their code for you, but can you pay for it? And can you wait?

20. What about local technical support?

Of course, you want immediate attention to any problem on your search engine because your business depends on it; a hotline is not enough.

If you want to know more about search engines, please let us know.