Lesson-01 AI Introduction: Syntax Tree and Probability Model
如果給定一個語法,我們怎麼生成語法呢?
import random
sentence = """
句子 = 主 謂 賓
主 = 你 | 我 | 他
"""
two_number = """
num* => num num* | num
num => 0 | 1 | 2 | 3 | 4
"""
def two_num(): return num() + num()
def num():
return random.choice("0 | 1 | 2 | 3 | 4 ".split('|'))
def numbers():
if random.random() < 0.5:
return num()
else:
return num() + numbers()
for i in range(10):
print(numbers())
1. 語法可以通過定義最簡單的函數來實現
2. 我們可以通過遞歸,來生成更復雜,“無限”長的文字
simple_grammar = """
sentence => noun_phrase verb_phrase
noun_phrase => Article Adj* noun
Adj* => Adj | Adj Adj*
verb_phrase => verb noun_phrase
Article => 一個 | 這個
noun => 女人 | 籃球 | 桌子 | 小貓
verb => 看着 | 坐在 | 聽着 | 看見
Adj => 藍色的 | 好看的 | 小小的
"""
another_grammar = """
#
"""
import random
def adj(): return random.choice('藍色的 | 好看的 | 小小的'.split('|')).split()[0]
def adj_star():
# 爲什麼如果不用if-else 的random,我們需要用lambda
return random.choice([lambda : '', lambda : adj() + adj_star()])()
def adj_star():
return random.choice([lambda : '', lambda : adj() + adj_star()])()
for i in range(10):
print(adj_star())
But the question is ?
如果我們更換了語法,會發現所有寫過的程序,都要重新寫。😦
number_ops = """
expression => expression num_op | num_op
num_op => num op num
op => + | - | * | /
num => 0 | 1 | 2 | 3 | 4
"""
def generate_grammar(grammar_str: str, target, split='=>'):
grammar = {}
for line in grammar_str.split('\n'):
if not line: continue
# two => num + num
expression, formula = line.split(split)
formulas = formula.split('|')
formulas = [f.split() for f in formulas]
grammar[expression.strip()] = formulas
return grammar
choice_a_expr = random.choice
def generate_by_grammar(grammar: dict, target: str):
if target not in grammar: return target
# the above line is to test if target is a key
expr = choice_a_expr(grammar[target])
return ''.join(generate_by_grammar(grammar, t) for t in expr)
def generate_by_str(grammar_str, split, target):
grammar = generate_grammar(grammar_str, target, split)
return generate_by_grammar(grammar, target)
generate_by_str(number_ops, split='=>', target='expression')
two => num + num | num - num
num => 0 | 1 | 2 | 3 | 4
#在西部世界裏,一個”人類“的語言可以定義爲:
human = """
human = 自己 尋找 活動
自己 = 我 | 俺 | 我們
尋找 = 找找 | 想找點
活動 = 樂子 | 玩的
"""
假如既然 = """
句子 = if someone state , then do something
if = 既然 | 如果 | 假設
someone = one 和 someone | one
one = 小紅 | 小藍 | 小綠 | 白白
state = 餓了 | 醒了 | 醉了 | 癲狂了
then = 那麼 | 就 | 可以
do = 去
something = 喫飯 | 玩耍 | 去浪 | 睡覺
"""
#一個“接待員”的語言可以定義爲
host = """
host = 寒暄 報數 詢問 業務相關 結尾
報數 = 我是 數字 號 ,
數字 = 單個數字 | 數字 單個數字
單個數字 = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
寒暄 = 稱謂 打招呼 | 打招呼
稱謂 = 人稱 ,
人稱 = 先生 | 女士 | 小朋友
打招呼 = 你好 | 您好
詢問 = 請問你要 | 您需要
業務相關 = 具體業務
具體業務 = 喝酒 | 打牌 | 打獵 | 賭博
結尾 = 嗎?
"""
for i in range(10):
print(generate_by_str(假如既然, split='=', target='句子'))
for i in range(10):
print(generate_by_str(host, split='=', target='host'))
希望能夠生成最合理的一句話?
Eliza
Data Driven
我們的目標是,希望能做一個程序,然後,當輸入的數據變化的時候,我們的程序不用重寫。Generalization.
AI? 如何能自動化解決問題,我們找到一個方法之後,輸入變了,我們的這個方法,不用變。
simpel_programming = '''
programming => if_stmt | assign | while_loop
while_loop => while ( cond ) { change_line stmt change_line }
if_stmt => if ( cond ) { change_line stmt change_line } | if ( cond ) { change_line stmt change_line } else { change_line stmt change_line }
change_line => /N
cond => var op var
op => | == | < | >= | <=
stmt => assign | if_stmt
assign => var = var
var => var _ num | words
words => words _ word | word
word => name | info | student | lib | database
nums => nums num | num
num => 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
'''
print(generate_by_str(simpel_programming, target='programming', split='=>'))
def pretty_print(line):
# utility tool function
lines = line.split('/N')
code_lines = []
for i, sen in enumerate(lines):
if i < len(lines) / 2:
#print()
code_lines.append(i * " " + sen)
else:
code_lines.append((len(lines) - i) * " " + sen)
return code_lines
generated_programming = []
for i in range(20):
generated_programming += pretty_print(generate_by_str(simpel_programming, target='programming', split='=>'))
for line in generated_programming:
print(line)
Language Model
1. 條件概率
2. 獨立概率
Review: 1. 條件概率?
假設你 365天,遲到30次
Pr(遲到) = 30/365
假如,你這一年中,有60次拉肚子,其中20次遲到了
Pr(遲到 | 拉肚子) = 20 / 60
= Pr(遲到&拉肚子) / Pr(拉肚子)
= (20 / 365) / (60 / 365)
= 20 / 60
Pr(你遲到 | 伊利諾伊發生車禍) = Pr(你遲到)
= Pr(你遲到&伊利諾伊車禍) / Pr(伊利諾伊車禍)
Pr(你遲到&伊利諾伊車禍) = Pr(你遲到) * Pr(伊利諾伊車禍)
Pr(你遲到 | 肚子痛 & 伊利諾伊發生車禍) = Pr(你遲到 | 肚子痛)
= Pr(你遲到&肚子痛) / Pr(肚子疼)
~ Count(你遲到且肚子痛) / Count(肚子痛)
Pr(其實就和隨機森林原理一樣)
-> Pr(其實&就和&隨機森林&原理&一樣)
-> Pr(其實|就和&隨機森林&原理&一樣)Pr(就和&隨機森林&原理&一樣)
—> Pr(其實|就和)Pr(就和&隨機森林&原理&一樣)
-> Pr(其實|就和)Pr(就和|隨機森林&原理&一樣)Pr(隨機森林&原理&一樣)
-> Pr(其實|就和)Pr(就和|隨機森林)Pr(隨機森林&原理&一樣)
-> Pr(其實|就和)Pr(就和|隨機森林)Pr(隨機森林|原理)Pr(原理&一樣)
-> Pr(其實|就和)Pr(就和|隨機森林)Pr(隨機森林|原理)Pr(原理|一樣)Pr(一樣)
語言學家們做了一個簡化
import random
random.choice(range(100))
filename = '/Users/gaominquan/Downloads/sqlResult_1558435.csv'
import pandas as pd
content = pd.read_csv(filename, encoding='gb18030')
content.head()
articles = content['content'].tolist()
len(articles)
invalid
articles[0]
import re # 正則表達式
def token(string):
# we will learn the regular expression next course.
return re.findall('\w+', string)
token(articles[0])
import jieba
list(jieba.cut('這個是用來做漢語分詞的'))
from collections import Counter
with_jieba_cut = Counter(jieba.cut(articles[110]))
with_jieba_cut.most_common()[:10]
''.join(token(articles[110]))
articles_clean = [''.join(token(str(a)))for a in articles]
len(articles_clean)
len(articles_clean)
假如,你做了很久的數據預處理
AI的問題裏邊,65%都是在做數據預處理
我們要養成一個習慣,就是把重要的信息,及時保存起來
存到硬盤裏
with open('article_9k.txt', 'w') as f:
for a in articles_clean:
f.write(a + '\n')
!ls
def cut(string): return jieba.cut(string)
import jieba
def cut(string): return jieba.cut(string)
ALL_TOKEN = cut(open('article_9k.txt').read())
TOKEN = []
for i, t in enumerate(ALL_TOKEN):
if i > 50000: break
# 大家把它變成20萬
if i % 1000 == 0: print(i)
TOKEN.append(t)
len(TOKEN)
from functools import reduce
from operator import add, mul
reduce(add, [1, 2, 3, 4, 5, 8])
[1, 2, 3] + [3, 43, 5]
from collections import Counter
words_count = Counter(TOKEN)
words_count.most_common(100)
frequiences = [f for w, f in words_count.most_common(100)]
x = [i for i in range(100)]
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(x, frequiences)
NLP比較重要的規律:在很大的一個text corpus,文字集合中,出現頻率第二多的單詞,是出現頻率第一多單詞的頻率的1/2, 出現頻率第n多的單詞,是出現頻率最高的單詞的1/n.
import numpy as np
words_count['我們']
def prob_1(word):
return words_count[word] / len(TOKEN)
prob_1('我們')
TOKEN = [str(t) for t in TOKEN]
TOKEN_2_GRAM = [''.join(TOKEN[i:i+2]) for i in range(len(TOKEN[:-2]))]
TOKEN_2_GRAM[10:]
words_count_2 = Counter(TOKEN_2_GRAM)
def prob_1(word): return words_count[word] / len(TOKEN)
def prob_2(word1, word2):
if word1 + word2 in words_count_2:
return words_count_2[word1+word2] / words_count[word2]
else: # out of vocabulary problem
return 1 / len(words_count)
prob_2('我們', '在')
prob_2('在', '喫飯')
prob_2('用', '手機')
def get_probablity(sentence):
words = list(cut(sentence))
sentence_pro = 1
for i, word in enumerate(words[:-1]):
next_ = words[i+1]
probability = prob_2(word, next_)
sentence_pro *= probability
sentence_pro *= prob_1(words[-1])
return sentence_pro
get_probablity('小明今天抽獎抽到一臺蘋果手機')
get_probablity('小明今天抽獎抽到一架波音飛機')
get_probablity('洋蔥奶昔來一杯')
get_probablity('養樂多綠來一杯')
host
need_compared = [
"今天晚上請你喫大餐,我們一起喫日料 明天晚上請你喫大餐,我們一起喫蘋果",
"真事一隻好看的小貓 真是一隻好看的小貓",
"今晚我去喫火鍋 今晚火鍋去喫我",
"洋蔥奶昔來一杯 養樂多綠來一杯"
]
for s in need_compared:
s1, s2 = s.split()
p1, p2 = get_probablity(s1), get_probablity(s2)
better = s1 if p1 > p2 else s2
print('{} is more possible'.format(better))
print('-'*4 + ' {} with probility {}'.format(s1, p1))
print('-'*4 + ' {} with probility {}'.format(s2, p2))