python實現歌曲相似度比較(超詳細,總有收穫)

python實現歌曲相似度比較

2019/9/20
最近學信號與系統,想着弄個小項目來提高學習興趣。特此記錄一下。
這是大概想到的準備工作,一邊推進,一邊學吧!!!
大概需要的準備
2019/9/21

頻域信號處理

FFT變換所得的複數的含義:

  • 下標爲0的實數表示時域信號的直流部分
  • 下標爲i的複數爲a+bj表示時域信號中週期爲N/i個取樣值的正弦波和餘弦波的成分,其中a表示餘弦波形的成分,b表示正弦波形的成分
  • 複數的模的兩倍爲對應頻率的餘弦波的振幅
  • 複數的輻角表示對應頻率的餘弦波的相位
import numpy as np
from scipy.fftpack import fft, ifft
import matplotlib.pyplot as plt
from matplotlib.pylab import mpl
x = np.arange(0, 2*np.pi, 2*np.pi/128)
y = 0.3*np.cos(x) + 0.5*np.cos(2*x+np.pi/4) + 0.8*np.cos(3*x-np.pi/3) + np.sin(4*x) + np.cos(x)
yf = fft(y)/len(y)
print(np.array_str(yf[:5], suppress_small=True))
for ii in range(0, 5):
    print(np.abs(yf[ii]), np.rad2deg(np.angle(yf[ii])))

運行上述程序可以觀察得到以上結論

合成時域信號

需要着重解釋的是多個餘弦信號合成任意時域信號的過程:

FFT轉換得到的N個複數組成的數組A,AiA_i表示第ii個子信號,其中i=0i=0的子信號表示直流信號,且Re(Ai)Re(A_i)表示直流信號的振幅。2×Re(Ai)=Amplitude(signalsin)2 \times Re(A_i) = Amplitude(signal_{sin})
2×Im(Ai)=Amplitude(signalcos)2 \times Im(A_i) = Amplitude(signal_{cos})
利用前kk個自信號合成過程用數學表達式表示:
2×i=1k{Re(Ai)cos(it)Im(Ai)sin(it)}+Re(A0)2\times \sum_{i=1}^{k}\{Re(A_i)cos(it)-Im(A_i)sin(it)\}+Re(A_0)
代碼如下所示

import numpy as np
from scipy.fftpack import fft, ifft
import matplotlib.pyplot as plt
from matplotlib.pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']  #顯示中文
mpl.rcParams['axes.unicode_minus'] = False  #顯示負號
# x = np.arange(0, 2*np.pi, 2*np.pi/128)
# y = 0.3*np.cos(x) + 0.5*np.cos(2*x+np.pi/4) + 0.8*np.cos(3*x-np.pi/3) + np.sin(4*x) + np.cos(x)
# yf = fft(y)/len(y)
# print(np.array_str(yf[:5], suppress_small=True))
# for ii in range(0, 5):
#     print(np.abs(yf[ii]), np.rad2deg(np.angle(yf[ii])))
def triangle_wave(size):
    x = np.arange(0, 1, 1.0/size)
    y = np.where(x < 0.5, x, 0)
    y = np.where(x >= 0.5, 1-x, y)
    return x, y
###
def fft_comnbine(bins, n, loops):
    length = len(bins)*loops
    data=np.zeros(length)
    index=loops*np.arange(0, 2*np.pi, (2*np.pi)/length)

    for k, p in enumerate(bins[:n]):
        if k != 0:
            p *= 2
        ###合成時域信號的過程
        data += np.real(p)*np.cos(k*index)
        data -= np.imag(p)*np.sin(k*index)
    return index, data
fft_size = 256
###對三角波進行FFT
x, y = triangle_wave(fft_size)
fy = fft(y)/fft_size
loops = 4
y = np.tile(y, (1, loops))
print(y.shape)
y.shape = (fft_size*loops, )#畫圖python的特殊癖好
###
fig, axes = plt.subplots(2, 1, figsize=(8, 6))
eps = 1e-5
# axes[0].plot(np.clip(20*np.log10(np.abs(fy[:20])+eps), -120, 120), "o")
axes[0].plot(np.abs(fy[:20]), "o")
axes[0].set_xlabel(u"頻率窗口(frequency bin)")
axes[0].set_ylabel(u"幅值(dB)")
axes[1].plot(y, label=u"原始三角波", linewidth = 2)
for ii in [0, 1, 3, 5, 7, 9]:
    index, data = fft_comnbine(fy, ii+1, loops)
    axes[1].plot(data, label="N=%s" % ii, alpha=0.6)
print(index[:20])
axes[1].legend(loc="best")
plt.show()
理論部分後面學了信號與系統在深究吧

哈——現在的我已經學完信號與系統了,回覆幾個一開始學習遇到的問題。

  1. 複數的模的兩倍爲什麼對應頻率的餘弦波的振幅?
    ans:若x(t)x(t)爲實信號,那麼
    由傅里葉變換:
    fn=F{x(t)}f_n = \mathscr{F}\{x(t)\}
    得到推導過程中運用了歐拉公式使得餘弦波的振幅乘上1/2,而相位不變。

  2. 爲什麼週期爲N的離散信號,它的傅里葉變換的週期也是N?
    ans:這個就是離散信號的傅里葉變換的週期性。可以在奧本海默的信號與系統的傅里葉性質表看到。

順便提及奈奎斯特頻率
在採樣定理中,採樣頻率必須大於2ωm2 \omega_m,這個2ωm2\omega_mj就稱作奈奎斯特頻率,目的是防止採樣信號的頻率防止重疊,其中ωm\omega_m爲原始信號的頻域ω\omega的最大值。

利用pydub和ffmpeg處理音頻

寫在前面:
RuntimeWarning: Couldn’t find ffmpeg or avconv - defaulting to ffmpeg, but may not work 解決辦法——ffmpeg的bin 目錄添加到path變量裏,注意是path變量而不僅僅是簡單的加到系統變量中!!!然後重啓。

一、將mp3轉換爲wav格式,並將歌曲劃分爲幾個部分

說在前面
1.將歌曲劃分爲幾部分主要是爲了將特徵的時間順序體現出來
2. wav:非壓縮文件格式。
3.mp3:壓縮文件格式。
代碼如下:

tail, track = os.path.split(mp3_path)
song_name = track.split('.')
wav_path = os.path.join(tail, 'w_session', song_name[0]+'.wav')
sound = AudioSegment.from_file(mp3_path, format='mp3')
sound.export(wav_path, format='wav')

獲取wav文件信息:

w = wave.open(wav_path)
params = w.getparams()
print(params)
#聲道數、量化位數(byte)、採樣頻率、採樣點數 
nchannels, sampwidth, framerate, nframes = params[:4]
t = np.arange(0, nframes)*(1/framerate)#文件時間
strData = w.readframes(nframes)#讀取音頻,字符串格式
waveData = np.fromstring(strData,dtype=np.int16)#將字符串轉化爲int
#waveData = waveData*1.0/(max(abs(waveData)))#wave幅值歸一化
waveData = np.reshape(waveData,[nchannels, nframes])#雙通道數

劃分歌曲

for ii in range(nchannels):
    for jj in range(0, 4):
        end_time = start_time + chunk[jj]
        blockData = waveData[0, start_time*framerate:end_time*framerate]
        start_time = end_time

二、音頻特徵提取

按照處理空間區分

  • 要提取的特徵 詳情請點擊
    時域特徵:
    線性預測係數、過零率
    頻域特徵:
    Mel係數、LPC倒頻譜系數、熵特徵、光譜質心
    時頻特徵:
    小波係數
  • TOOLS:pyAudioAnalysis
    下載以及安裝方法:安裝方法
    個人感覺這個工具包滿新的,github上有各種issues。issues詳見

同時有一篇論文有對這個工具包有詳細的描述:論文

下面摘抄一部分:
Feature Extraction
Audio Features

  1. the audio signal is first divided into short-term windows (frames) and for each frame all 34 features are calculated. This results in a sequence of short-term feature vectors of 34 elements each. Widely accepted short-term window sizes are 20 to 100 ms.
  2. Typical values of the mid-term segment size can be 1 to 10 seconds.
  3. In cases of long recordings (e.g. music tracks) a long-term averaging of the mid-term features can be applied so that the whole signal is represented by an average vector of mid-term statistics.
  4. Extract mid-term features and long-term averages in order to produce one feature vector per audio signal.

三、計算相似矩陣

論文中提到:A similarity matrix is computed based on the cosine distances of the individual feature vectors.
但是在實際操作的過程中發現不同特徵的量綱不同,導致用餘弦相似度來計算特徵相似度不準確。例:

7 8 9
2.62281350727428e-10 0 -50.5964425626941
2.29494356256208e-11 0 -50.5964425626941
4.55467645371887e-11 0 -50.5964425626941

所以我決定計算不同特徵的相對比值,然後取平均值。

def similarity(v1, v2):
#   計算平均相似度
    temp = []
    sim = []
    p = 0
    q = 1
    
    for ii in range(v1.shape[0]):
        for jj in range(v1.shape[1]):
            if v1[ii, jj]!=0 or v2[ii, jj]!=0 :
                temp.append((1 - 
        abs(v1[ii, jj]-v2[ii, jj])/max(abs(v1[ii,
        jj]),abs(v2[ii, jj]))))
                q += 1
        sim.append(np.mean(temp[p:q]))
        p = q
    print(sim)
    return sim

此外可以嘗試馬氏距離——參考文章
下文爲部分摘抄。

  • 使用場景:

    1、度量兩個服從同一分佈並且其協方差矩陣爲C的隨機變量X與Y的差異程度
    2、度量X與某一類的均值向量的差異程度,判別樣本的歸屬。此時,Y爲類均值向量.

  • 馬氏距離的優缺點:
    優點:量綱無關,排除變量之間的相關性的干擾
    缺點:不同的特徵不能差別對待,可能誇大弱特徵

四、減少運行代碼的時間

之前將歌曲劃分爲等差序列的長度demo,可計算一個片段的特徵就要好久。我等不下去,所以決定想法子降低複雜度。我想到兩個辦法:

  • 在原來等差序列的片段上隨機選取四秒片段,計算特徵相似度。如果大於0.5,那麼在計算完整片段的特徵相似度。
  • 將原來採樣頻率44.1kHz縮小四倍
k = 4
ii = 0
w_decrease = [[], []]
#   降低音頻分辨率
for kk in (0, 1):
    while ii < len(w[:, kk]):
        if ii + k < len(w[:, kk]):
            w_decrease[kk].append(np.mean(w[ii:ii+k, kk]))
        else:
            w_decrease[kk].append(np.mean(
                    w[ii:len(w[:, kk])+1, kk]))
        ii = ii + k
w = w_decrease

五、完整代碼

# -*- coding: utf-8 -*-
"""
Created on Fri Jan 10 21:51:38 2020

@author: yoona
"""
import os
import sys
import wave
import numpy as np
#import struct
from pydub import AudioSegment
import matplotlib.pyplot as plt
from pyAudioAnalysis import audioFeatureExtraction as afe
import eyed3
import random
import math

def Features(path, mode):
    x = wave.open(path)
    params = x.getparams()
    print(params)

    if params[0] != 2:
        raise ValueError('通道數不等於2')

    strData = x.readframes(params[3])
    w = np.frombuffer(strData, dtype=np.int16)
    w = np.reshape(w,[params[3], params[0]])
    
    k = 4
    ii = 0
    w_decrease = [[], []]
    
    if mode == 'second':
    #   降低音頻分辨率
        for kk in (0, 1):
            while ii < len(w[:, kk]):
                if ii + k < len(w[:, kk]):
                    w_decrease[kk].append(np.mean(w[ii:ii+k, kk]))
                else:
                    w_decrease[kk].append(np.mean(
                            w[ii:len(w[:, kk])+1, kk]))
                ii = ii + k
        w = w_decrease
        
    eigen_vector_0 = afe.mtFeatureExtraction(
            w[:, 0], params[2],30.0, 30.0, 2, 2)
    eigen_vector_1 = afe.mtFeatureExtraction(
            w[:, 1], params[2],30.0, 30.0, 2, 2)
    
    return eigen_vector_0, eigen_vector_1

def read_wave(wav_path):
    w = wave.open(wav_path)
    params = w.getparams()
#    print(params)
#   聲道數、量化位數(byte)、採樣頻率、採樣點數 
    nchannels, sampwidth, framerate, nframes = params[:4]
    
#   文件時間
    t = np.arange(0, nframes)*(1/framerate)
    strData = w.readframes(nframes)#讀取音頻,字符串格式
    waveData = np.frombuffer(strData, dtype=np.int16)#將字符串轉化爲int
    waveData = waveData*1.0/(max(abs(waveData)))#wave幅值歸一化
    waveData = np.reshape(waveData,[nframes, nchannels])#雙通道數
    
#    plot the wave
    plt.figure()
    plt.subplot(4,1,1)
    plt.plot(t,waveData[:, 0])
    plt.xlabel("Time(s)")
    plt.ylabel("Amplitude")
    plt.title("Ch-1 wavedata")
    plt.grid('on')#標尺,on:有,off:無
    plt.subplot(4,1,3)
    plt.plot(t,waveData[:, 1])
    plt.xlabel("Time(s)")
    plt.ylabel("Amplitude")
    plt.title("Ch-2 wavedata")
    plt.grid('on')#標尺,on:有,off:無
    plt.show()  
    
def similarity(v1, v2):
#   計算平均相似度
    temp = []
    sim = []
    p = 0
    q = 1
    
    for ii in range(v1.shape[0]):
        for jj in range(v1.shape[1]):
            if v1[ii, jj]!=0 or v2[ii, jj]!=0 :
                temp.append((1 - 
        abs(v1[ii, jj]-v2[ii, jj])/max(abs(v1[ii, jj]),abs(v2[ii, jj]))))
                q += 1
        sim.append(np.mean(temp[p:q]))
        p = q
    print(sim)
    return sim

def compute_chunk_features(mp3_path):
# =============================================================================
# 計算相似度第一步
# =============================================================================
#   獲取歌曲時長
    mp3Info = eyed3.load(mp3_path)
    time = int(mp3Info.info.time_secs)
    print(time)
    tail, track = os.path.split(mp3_path)
    
#   創建兩個文件夾
    dirct_1 = tail + r'\wavSession'
    dirct_2 = tail + r'\wavBlock'
    if not os.path.exists(dirct_1):
        os.makedirs(dirct_1)
    if not os.path.exists(dirct_2):
        os.makedirs(dirct_2)  

#   獲取歌曲名字
    song_name = track.split('.')
#   轉換格式
    wav_all_path = os.path.join(tail, song_name[0]+'.wav')
    sound = AudioSegment.from_file(mp3_path, format='mp3')
    sound.export(wav_all_path, format='wav')
    read_wave(wav_all_path)
#   劃分音頻
    gap = 4
    diff = time/10 - 8
    start_time = 0
    end_time = math.floor(diff)
    vector_0 = np.zeros((10, 68))
    vector_1 = np.zeros((10, 68))
    info = []#記錄片段開始時間點
    
    for jj in range(5):
        wav_name = song_name[0]+str(jj)+'.wav'
        wav_path = os.path.join(tail, 'wavSession', wav_name)
#       隨機產生四秒片段
        rand_start = random.randint(start_time, end_time) 
        blockData = sound[rand_start*1000:(rand_start+gap)*1000]
##       音頻切片,時間的單位是毫秒
#        blockData = sound[start_time*1000:end_time*1000]
        blockData.export(wav_path, format='wav')
        eigVector_0, eigVector_1 = Features(wav_path, [])
        print(jj)# 標記程序運行進程
#       得到一個片段的特徵向量
        vector_0[jj, :] = np.mean(eigVector_0[0], 1)
        vector_1[jj, :] = np.mean(eigVector_1[0], 1)
#       迭代
        diff = diff + 4
        info.append((start_time, end_time))
        start_time = end_time
        end_time = math.floor(start_time + diff)
        
#   承上啓下
    end_time = start_time 
    
    for kk in range(5, 10):
#       迭代
        diff = diff - 4
        info.append((start_time, start_time + diff))
        start_time = end_time
        end_time = math.floor(start_time + diff)
        wav_name = song_name[0]+str(kk)+'.wav'
        wav_path = os.path.join(tail, 'wavSession', wav_name)
        
#       隨機產生四秒片段
        rand_start = random.randint(start_time, end_time) 
        blockData = sound[rand_start*1000:(rand_start+gap)*1000]        
#        blockData = sound[start_time*1000:end_time*1000]
        blockData.export(wav_path, format='wav')
        eigVector_0, eigVector_1 = Features(wav_path, [])
        print(kk)#標記程序運行進程
#       得到一個片段的特徵向量
        
        vector_0[kk, :] = np.mean(eigVector_0[0], 1)
        vector_1[kk, :] = np.mean(eigVector_1[0], 1)
        
    return vector_0, vector_1, info# 雙通道各自的特徵向量

def Compute_Bolck_Features(info, mp3_path):
# =============================================================================
# 計算相似度第二步
# =============================================================================
#   獲取歌曲時長
    mp3Info = eyed3.load(mp3_path)
    time = int(mp3Info.info.time_secs)
    print(time)
#   獲取歌曲名字
    tail, track = os.path.split(mp3_path)
    song_name = track.split('.')
#   轉換格式
    sound = AudioSegment.from_file(mp3_path, format='mp3')
    vector_0 = np.zeros((len(info), 68))
    vector_1 = np.zeros((len(info), 68))
    
    for kk in range(len(info)):
        #   獲取歌曲完整片段的特徵
        wav_name = song_name[0]+str(kk)+'.wav'
        wav_path = os.path.join(tail, 'wavBlock', wav_name)
        #   截取完整片段
        blockData = sound[info[kk][0]*1000:info[kk][1]*1000]        
        blockData.export(wav_path, format='wav')
        eigVector_0, eigVector_1 = Features(wav_path, 'second')
        print(kk)#標記程序運行進程
        #   得到一個片段的特徵向量
        vector_0[kk, :] = np.mean(eigVector_0[0], 1)
        vector_1[kk, :] = np.mean(eigVector_1[0], 1)
    return vector_0, vector_1

def file_exists(file_path):
    if os.path.splitext(file_path) == '.mp3':
        if os.path.isfile(file_path):
            return file_path
        else:
            raise TypeError('文件不存在')
    else:
        raise TypeError('文件格式錯誤,後綴不爲.mp3')

if __name__ == '__main__':

#for path, dirs, files in os.walk('C:/Users/yoona/Desktop/music_test/'):
#    for f in files:
#        if not f.endwith('.mp3'):
#            continue
# 把路徑組裝到一起
#path = r'C:\Users\yoona\Desktop\musictest'
#f = 'CARTA - Aranya (Jungle Festival Anthem).mp3'
#mp3_path = os.path.join(path, f)
# =============================================================================
# sa_b:a表示歌曲的序號,b表示歌曲的通道序號
# =============================================================================
#s1_0, s1_1, info1= compute_chunk_features(mp3_path) 
#    path_1 = file_exists(sys.argv[1])
#    path_2 = file_exists(sys.argv[2])]
    
    path_1 = r'C:\Users\yoona\Desktop\music\薛之謙 - 別.mp3'
    path_2 = r'C:\Users\yoona\Desktop\music\薛之謙 - 最好.mp3'
    
    s1_1, s1_2, info1 = compute_chunk_features(path_1)
    s2_1, s2_2, info2 = compute_chunk_features(path_2)
    
    sim_1 = similarity(s1_1, s2_1)#通道數1
    sim_2 = similarity(s1_2, s2_2)#通道數2
    
    info1_new = []
    info2_new = []
    
    for i, element in enumerate(sim_1):
        if element >= 0.5:
            info1_new.append(info1[i])
            
    if not info1_new:
        pos = np.argmax(sim_1)
        info1_new.append(info1[pos])
    s1_1, s1_2 = Compute_Bolck_Features(info1_new, path_1)
    
    for i, element in enumerate(sim_2):
        if element >= 0.5:
            info2_new.append(info2[i])

    if not info2_new:
        pos = np.argmax(sim_2)
        info2_new.append(info2[pos])
    s2_1, s2_2 = Compute_Bolck_Features(info2_new, path_2)
    
    sim_1 = similarity(s1_1, s2_1)#通道數1
    sim_2 = similarity(s1_2, s2_2)#通道數2

六、結果分析

第一組實驗對象:
A:薛之謙 - 最好.mp3
B:薛之謙 - 別.mp3
第二組實驗對象:
A:Karim Mika - Superficial Love.mp3
B: Burgess/JESSIA - Eclipse.mp3
第三組實驗對象:
A: CARTA - Aranya (Jungle Festival Anthem).mp3
B: 薛之謙 - 別.mp3

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章