经典算法之字符串匹配

1、BF算法

BF(Brute Force)算法就是朴素的暴力匹配。BF算法的基本思想是从主串的start位置开始与模式串进行匹配,如果相等,则继续比较后续字符,如果不相等则模式串回溯到开始位置,主串回溯到start+1位置,继续进行比较直至模式串的所有字符都已比较成功,则匹配成功,或者主串所有的字符已经比较完毕,没有找到完全匹配的子串,则匹配失败。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def BF(s, p):
'''
Brute-Force算法实现字符串匹配
'''
indies = []
n = len(s)
m = len(p)
for i in range(n - m + 1):
index = i
for j in range(m):
if s[index] == p[j]:
index += 1
else:
break
if index - i == m:
indies.append(i)

return indies

print(BF('abcxabcdabcdabcy','abcdabcy'))

2、KMP算法

KMP(Knuth-Morris-Pratt)算法是对BF算法的改进,差别在于KMP算法在出现字符串不相等的情况时,不需要返回到字串的开头重新比较。具体过程就是计算一张部分匹配表来改进移动距离。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def getNextList(s):
n = len(s)
nextList = [0, 0]
j = 0
for i in range(1, n):
while j > 0 and s[i] != s[j]:
j = nextList[j]
if s[i] == s[j]:
j += 1
nextList.append(j)
return nextList

def KMP(s, p):
'''
Knuth-Morris-Pratt算法实现字符串匹配
'''
n = len(s)
m = len(p)
nextList = getNextList(p)
indies = []
j = 0
for i in range(n):
while s[i] != p[j] and j > 0:
j = nextList[j]

if s[i] == p[j]:
j += 1
if j == m:
indies.append(i-m+1)
j = nextList[j]
return indies

print(KMP('abcxabcdabcdabcy','abcdabcy'))

3、BM算法

BM(Boyer-Moore)算法是对KMP进一步的改进。总体来说BM算法效率是高于KMP算法的,文本量越大BM算法的效果越明显。BM算法通过两张表来改进移动的距离。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def getBMBC(pattern):
# 预生成坏字符表
BMBC = dict()
for i in range(len(pattern) - 1):
char = pattern[i]
# 记录坏字符最右位置(不包括模式串最右侧字符)
BMBC[char] = i + 1
return BMBC

def getBMGS(pattern):
# 预生成好后缀表
BMGS = dict()

# 无后缀仅根据坏字移位符规则
BMGS[''] = 0

for i in range(len(pattern)):

# 好后缀
GS = pattern[len(pattern) - i - 1:]

for j in range(len(pattern) - i - 1):

# 匹配部分
NGS = pattern[j:j + i + 1]

# 记录模式串中好后缀最靠右位置(除结尾处)
if GS == NGS:
BMGS[GS] = len(pattern) - j - i - 1
return BMGS

def BM(string, pattern):
'''
Boyer-Moore算法实现字符串查找
'''
m = len(pattern)
n = len(string)
i = 0
j = m
indies = []
BMBC = getBMBC(pattern=pattern) # 坏字符表
BMGS = getBMGS(pattern=pattern) # 好后缀表
while i < n:
while (j > 0):
if i + j -1 >= n: # 当无法继续向下搜索就返回值
return indies

# 主串判断匹配部分
a = string[i + j - 1:i + m]

# 模式串判断匹配部分
b = pattern[j - 1:]

# 当前位匹配成功则继续匹配
if a == b:
j = j - 1

# 当前位匹配失败根据规则移位
else:
i = i + max(BMGS.setdefault(b[1:], m), j - BMBC.setdefault(string[i + j - 1], 0))
j = m

# 匹配成功返回匹配位置
if j == 0:
indies.append(i)
i += 1
j = len(pattern)

print(BM('abcxabcdabcdabcy','abcdabcy'))

4、sunday算法

sunday算法比BM算法更快,主体思路与BM算法相同,主要区别是模式串与文本串按从前往后顺序匹配比较。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def sunday_search(text, pattern):
m, n = len(pattern), len(text)
if m > n:
return []

# 预处理
last = {char: i for i, char in enumerate(pattern)}
result = []

# 在text中查找pattern
i = 0
while i <= n - m:
j = 0
while j < m and pattern[j] == text[i + j]:
j += 1

if j == m:
result.append(i)

if i + m >= n:
break

next_char = text[i + m]
if next_char not in last:
i += m + 1
else:
i += m - last[next_char]

return result

print(sunday_search('ahbsjiasOas','as'))

5、Aho-corasick Automaton算法(AC自动机)

AC自动机(Aho-Corasick Automaton)是多模匹配算法的一种。所谓多模匹配,是指在字符串匹配中,模式串有多个。前面所介绍的KMP、BM为单模匹配,即模式串只有一个。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
class Node(object):

def __init__(self):
self.next = {}
self.fail = None
self.isWord = False

class Ahocorasick(object):
def __init__(self):
self.__root = Node()

def addWord(self, word):
'''
添加关键词到Tire树中
'''
tmp = self.__root
for i in range(0, len(word)):
if word[i] not in tmp.next:
tmp.next[word[i]] = Node()
tmp = tmp.next[word[i]]
tmp.isWord = True

def make(self):
'''
构建自动机,失效函数
'''
tmpQueue = []
tmpQueue.append(self.__root)
while(len(tmpQueue) > 0):
temp = tmpQueue.pop()
p = None
for k, v in temp.next.items():
if temp == self.__root:
temp.next[k].fail = self.__root
else:
p = temp.fail
while p is not None:
if k in p.next:
temp.next[k].fail = p.next[k]
break
p = p.fail
if p is None :
temp.next[k].fail = self.__root
tmpQueue.append(temp.next[k])

def search(self, content):
p = self.__root
result = []
startWordIndex = 0
endWordIndex = -1
currentPosition = 0

while currentPosition < len(content):
word = content[currentPosition]
# 检索状态机,直到匹配
while word not in p.next and p != self.__root:
p = p.fail

if word in p.next:
if p == self.__root:
# 若当前节点是根且存在转移状态,则说明是匹配词的开头,记录词的起始位置
startWordIndex = currentPosition
# 转移状态机的状态
p = p.next[word]
else:
p = self.__root

if p.isWord:
# 若状态为词的结尾,则把词放进结果集
result.append((startWordIndex, currentPosition))

currentPosition += 1
return result

def replace(self, content):
replacepos = self.search(content)
result = content
for i in replacepos:
result = result[0:i[0]] + (i[1] - i[0] + 1) * u'*' + content[i[1] + 1:]
return result


if __name__ == '__main__':
ah = Ahocorasick()
text = '中国科学技术大学位于安徽合肥,是一所理工科大学,又称为南七技校。'
patterns = '科学 大学 理工科 技校'
words = patterns.split(' ')
for w in words:
ah.addWord(w)
ah.make()
results = ah.search(text)
print(results)
if len(results) == 0:
print('No find.')
else:
print(len(results),' matching results are listed below.')
print('-------' + '-'*len(text) + '-------')
print(text)
count = 0
for site in results:
w = text[site[0]:site[1]+1]
count += 1
print(' '*site[0] + w + ' '*(len(text)-site[1]) + ' ' + str(site[0]) + ' ' + str(count))
print('-------' + '-'*len(text) + '-------')