Soundex algorithm is used for encoding English words on the basis of their sound. The main purpose is to avoid spelling errors when recording the names of people in a census. Source code can be presented as a code of 4 characters in the form LDDD, where L is the first letter of the name and D represents a decimal digit (for English alphabet D is in the range of 0 to 6). The rules for coding are the following:
- The first character code is always the first letter of the name, regardless of other rules.
- The letters “A”, “E”, “I”, “O”, “U”, “H”, “W” and “Y” are released.
- The letters “B”, “F”, “P” and “V” are coded as 1.
- The letters “C”, “G”, “J”, “K”, “Q”, “S”, “X” and “Z” are coded as 2.
- Letter “D” and “T” are coded as 3.
- The letter “L” is coded as 4.
- The letters “M” and “N” are coded as 5.
- The letter “R” is coded as 6.
- If there are two repeated letters, the second is skipped.
- If any letter has the same code as the previous, it is skipped.
- Names with prefixes are encoded as a prefix or without him.
- Letters with the same code separated by “A”, “E”, “I”, “O” or “U” are coded.
- Letters with the same code separated by “H” or “W” are not coded.
- If characters in the end code are less than four are supplemented with 0 to four.
For example coding name “Lee”: Using Rule 1 take the first letter to start code. The following symbols are skipped because of rule 2. Since the code we have contains only one real character three 0 are added and we receive source code L000.
For example coding name “Ashcraft”: Using Rule 1 take the first letter to start code. Encodes a letter “S” as two. The next two points are not encrypted and are skipped because of rules 4 and 13. The fifth point is “R” and then coding 6. miss a vowel and encode the letter “F” by 1. The last letter “T” is omitted, as already we have filled all positions of code we have received A261.
Name | Code |
Lee | L000 |
Washington | W252 |
Gutierrez | G362 |
Pfister | P236 |
Jackson | J250 |
Tymczak | T522 |
VanDeusen | V532 |
VanDeusen | D250 |
Ashcraft | A261 |
Smith | S530 |
Smythe | S530 |
The table with sample coded seen how two similar-sounding names are coded in the same way, thus avoiding duplication due to spelling errors because of close sound.
In case of a modification of groups of various sounds Soundex algorithm can be used for different languages.
Leave A Comment