Noise Words and Migration from Oracle Text to SQL Server 2005
Oracle Text functionality can be emulated in SQL Server with Full-Text Search feature. In this article I am going to tell about some peculiarities of Full-Text Search, namely, about the noise words issue, and to provide a solution for using these words in queries to achieve the desired effect.
When using Oracle Text, we can identify the words in the appropriate language that are not to be indexed. They are called stopwords and stored in stoplists. If a stopword is contained in one of the search conditions (for example, ‘the and processor’), the text containing the word ‘processor’ will be returned. But if the query condition consists only of stopwords (for example, ‘the and that’) the query will not return any result. A developer can manipulate contents of stoplists by means of CTX_DLL package.
In SQL Server 2005 you can find similar lists of the words to be ignored. These words are called ‘noise words’ and they are stored in noise word files.
Anyone who works with SQL Server’s Full-Text Search feature can come across some interesting gotcha related to the noise words. For example, consider the following query:
select ‘Yes’ as Result from MyTable
where contains(text_col,’”another” and “processor”‘)
As the searched table does contain both the words “another” and “processor” we wait for the result “Yes” after executing the above query. But we can either get the informational message (“Informational: The full-text search condition contained noise word(s).(0 row(s) affected)” ) if the Results to Text option is set in Management Studio, or just do not get any result. What happened?
The issue is that ‘another’, ’does’, ‘just’ and some others are noise words. If a search condition has the noise words only, then the query fails to return any result.
I should admit that Oracle stoplists and SQL Server noise words files differ in content. There are a lot of noise words in SQL Server (by default) which are not included in default stoplists in Oracle. They are ‘another’, ‘before’, ‘come’, ‘get’, ‘himself’ etc., and they also include digits from 0 to 9 and the dollar sign (‘$’). This can become an issue during Oracle to SQL Server migration.
The noise words files are located in
$SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTDATA\
folder. The file representing your language can be opened and modified with a text editor (e.g. the English US file is named noiseENU.txt). So, it is obvious that the issue can be solved by deleting the necessary words from the relevant noise words file. But notice that noise words files discard commonly occurring words and deleting all the words can lead to a significant growth of the full-text index.
After editing the file and saving it, it is necessary to repopulate full-text catalogs before the changes will take effect. This can be done by executing the following statement:
alter fulltext index on mytable start update population
Then, do not forget to restart SQL Server Full-Text Search service to apply the changes.
Now, if executing the above full-text search query we will get the expected result “Yes”.
But what to do if we do not want our full-text index to grow after deleting the noise words from the file? Is there a solution to get results from a query execution without editing the noise words file?
There is a solution based on adding the asterisk (*) after a noise word in a full-text search condition. But this solution seems to be a side effect of manipulations with a noise words file. According to the solution, the above mentioned query should be rewritten as follows:
select ‘Yes’ as Result from MyTable
where contains(text_col,’”another*” and “processor”‘)
To check the effect of this modification, I filled the edited noise word file to its primary state, repopulated the index and restarted SQL Server Full-Text Search service. Executing the query showed that it works. I tried different variations with different noise words and always got the result “Yes”. But after I deleted the full-text index and created it again, the solution stopped to work. So, it does not solve the problem.
Finally, I have found an option in SQL Server that allows suppressing messages when using noise words in search conditions. This option (“transform noise words”) can be set by means of sp_configure system stored procedure:
sp_configure ’show advanced options’, 1
RECONFIGURE
GO
sp_configure ‘transform noise words’, 1
RECONFIGURE
GO
When the “transform noise words” option is set to 1, SQL Server replaces noise words with the asterisk (*) in phrase queries. Note that the option is considered as advanced, so, to make it available in SQL Server, it is required to set “show advanced options” to 1. This makes all advanced options available.
Thus, if you need only to skip the noise words, the setting “transform noise words” option will give the required effect. Care should be taken when applying this option as when the search condition will consist only of noise words, the query will not return any result. But if you want the noise words to be searched as any other words then you should delete them from the noise words file appropriate to your language settings.
Written by Galina Snurnikova
Tagged as Full-Text Search, Noise Words, SQL Server + Categorized as Newsletters, Newsletter #4, January 6, 2009
Leave a Reply
You must be logged in to post a comment.