Apriori算法是最著名的关联规则的挖掘方法,可以使用它挖掘频繁项集发现数据中的规律。著名的“啤酒与尿布”案例就是在分析大量超市的事务之后发现了“啤酒”与“尿布”这一频繁项集。这篇笔记主要是记录Apriori的Python3代码实现的,会就算法来讲解Apriori挖掘频繁项集的步骤,算法的详细内容在《数据挖掘-概念与技术》一书中有非常详细的讲解,这里不再赘述
完整代码在这里
前面文章https://www.ph0en1x.space/2018/05/13/Apriori/介绍了如何利用Apriori算法挖掘频繁项集,这篇文章将继续介绍如何找到关联规则
生成关联规则的入口函数,从先减少一个项集开始,然后交由rulesFromConseq进行递归
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15def generateRules(L, supportData, minConf=0.7):
'''
L: itemsets
supportData: map of the item to the number
'''
bigRuleList = []
for i in range(1, len(L)):
for freqSet in L[i]:
Hl = [frozenset([item]) for item in freqSet]
if i > 1:
rulesFromConseq(freqSet, Hl, supportData, bigRuleList,
minConf)
else:
calcConf(freqSet, Hl, supportData, bigRuleList, minConf)
return bigRuleList
计算置信度是否超过阈值
1
2
3
4
5
6
7
8
9
10def calcConf(freqSet, H, supportData, brl, minConf=0.7):
prunedH=[]
for conseq in H:
conf = supportData[freqSet] / supportData[freqSet-conseq]
if conf >= minConf:
print (freqSet-conseq, '--->', conseq, 'conf:', conf)
brl.append((freqSet-conseq, conseq, conf))
# 剪枝,如果置信度已经低于阈值,就不用继续加了
prunedH.append(conseq)
return prunedH
递归增加分子中的item数量
1
2
3
4
5
6
7
8
9
10
11def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
'''
递归扫描关联关系
少1个 少2个...
'''
m = len(H[0])
if len(freqSet) > (m+1):
Hmp1 = aprioriGen(H, m+1)
Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
if len(Hmp1) > 1:
rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)在main函数中添加关联关系的调用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15if __name__ == "__main__":
# dataSet, dataMap = loadDataSet()
dataSet = loadDataSet()
start = time.time()
L, suppData = apriori(dataSet, minSupport=100)
end = time.time()
cnt = 1
for i in L:
print(cnt, i)
cnt += 1
print('Apriori total time:', end-start, 's')
print("Generate Rule Begin:")
# 生成关联关系
generateRules(L, suppData, minConf=0.3)