搜索引擎Sphinx億級數據大併發實時搜索通用架構設計方案 原 薦

一、市場份額

1.簡介

Sphinx

優勢:

  1. Sphinx是一個基於SQL的C++開發的開源全文檢索引擎,在1千萬條記錄情況下的查詢速度爲0.x秒(毫秒級)
  2. 始於2001年,近20年的市場打磨(本文基於目前最新版3.0.3)
  3. 搜索引擎市場份額佔比排名第5
  4. 阿里雲RDS中有1款Mysql存儲引擎:SphinxSE就是爲此配套,支持SQL JOIN
  5. 提供SphinxQL,像使用SQL一樣使用搜索引擎
  6. PHP官網文檔目前收錄了4款搜索引擎擴展,其中1種就是Sphinx

二、基礎概念

1.搜索引擎

搜索引擎(Search Engine)是指根據一定的策略、運用特定的計算機程序從互聯網上搜集信息,在對信息進行組織和處理後,爲用戶提供檢索服務,將用戶檢索相關的信息展示給用戶的系統。搜索引擎包括全文索引、目錄索引、元搜索引擎、垂直搜索引擎、集合式搜索引擎、門戶搜索引擎與免費鏈接列表等。

2.數據源

數據來源,目前系統支持一些主流存儲產品的自動對接。 比如:mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc... 支持寫SQL JOIN語句,作爲數據來源。

3.分詞

對推送上來的文檔進行詞組切分,本文使用的是一元分詞法,並非中文分詞、盤古分詞等。 一元分詞: 我愛中國 將會分成 我 愛 中 國

4.索引

  1. 主索引:type=plain 通過SQL語句控制數據源範圍
  2. 增量索引:type=plain 通過SQL語句控制數據源範圍
  3. 實時索引:type=rt 在內存中CRUD進行搜索控制的類SQL操作
  4. 分佈式索引:type=distributed 上述3種的結合,且可誇服務器拼接數據

5.幽靈數據

場景

在主索引中,有篇文章:我要吃飯

後來更改爲:我要喝酒,並建立增量索引

這時在增量索引中搜新數據 喝酒 可以搜到,搜舊數據 吃飯 還是能搜到。

如何確保主索引在大數據下文檔更新的及時性?

三、實戰演練

1.準備數據源

2個函數

DELIMITER $$
CREATE DEFINER=`root`@`localhost` FUNCTION `rand_num`(`start_number` INT(11) UNSIGNED, `end_number` INT(11) UNSIGNED) RETURNS int(11)
BEGIN
	DECLARE i int default 0;
	set i = FLOOR(start_number+RAND() * (end_number-start_number+1));
	return i;
END$$
DELIMITER ;

DELIMITER $$
CREATE DEFINER=`root`@`localhost` FUNCTION `rand_string`(`number` INT(11) UNSIGNED) RETURNS varchar(1024) CHARSET utf8
BEGIN
    DECLARE chars_str varchar(1024) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789【買2免1】榮誠月餅納福吉祥4口味月餅520g/袋中秋傳統糕點點心配送範圍送貨範圍僅限常州、揚州、蘇州、鹽城、徐州、宿遷、淮安、泰州、無錫、連雲港、南通、鎮江、南京地區(生鮮類別僅限部分地區)支付方式檢測到您當前處於非安全網絡環境,部分商品信息可能不準確,請在交易支付頁面再次確認商品價格信息哈啊';
    DECLARE return_str varchar(1024) DEFAULT '';
    DECLARE i int DEFAULT 0;
    WHILE i < number DO
        set return_str = CONCAT(return_str,SUBSTRING(chars_str,FLOOR(1+RAND()*200),1));
        set i=i+1;
    END while;
    RETURN return_str;
END$$
DELIMITER ;

1個存儲過程

DELIMITER $$
CREATE DEFINER=`root`@`localhost` PROCEDURE `insert_main`(IN `number` INT(10) UNSIGNED)
BEGIN
	DECLARE i int default 0;
	# 設置自動提交爲false
	set autocommit =0;
	# 開啓循環
	REPEAT
		set i = i+1;
		insert into main values(null,rand_num(0,999999999),rand_string(rand_num(0,1024)));
	
	UNTIL i=number
	END REPEAT;
    commit;
END$$
DELIMITER ;

3個表

CREATE TABLE `add` (
 `type` int(10) unsigned NOT NULL,
 `id` int(10) unsigned NOT NULL,
 PRIMARY KEY (`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `change` (
 `type` int(10) unsigned NOT NULL,
 `id` int(10) unsigned NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `main` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `type` int(10) unsigned NOT NULL DEFAULT '0',
 `beizhu` text NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8

生成1億條測試數據:23.1GiB

mysql -uroot -p123456; 
use test;
call insert_main(100000000);

mysql> SELECT COUNT(*) FROM `main`;
+----------+
| COUNT(*) |
+----------+
| 100000000 |
+----------+
1 row in set (4 min 38.74 sec)

2.安裝Sphinx

wget -P ~/ http://sphinxsearch.com/files/sphinx-3.0.3-facc3fb-linux-amd64.tar.gz
mkdir ~/sphinx
cd ~/sphinx
tar -xzvf ~/sphinx-3.0.3-facc3fb-linux-amd64.tar.gz -C ./ --strip-components 1
mkdir log/ data/

3.搜索配置

sudo vim ~/sphinx/etc/sphinx.conf

1/8.主數據源

source main
{
    type            = mysql
    sql_host        = localhost
    sql_user        = root
    sql_pass        = 123456
    sql_db            = test
    sql_port        = 3306
    sql_query_pre    = SET NAMES utf8
    sql_query_pre = REPLACE INTO `add` SELECT 1,MAX(id) FROM `main`
    sql_query_pre = TRUNCATE `change`
    sql_query = SELECT `id`, `type`, `beizhu` FROM `main` WHERE `id`<=( SELECT `id` FROM `add` WHERE `type`=1) 
    sql_attr_uint        = type
}

2/8增量數據源

source zengliang:main
{
    sql_query_pre    = SET NAMES utf8
    sql_query_pre = 
    sql_query_pre = 
    sql_query = SELECT `id`, `type`, `beizhu` FROM `main` WHERE `id`>( SELECT `id` FROM `add` WHERE `type`=1) UNION SELECT `id`, `type`, `beizhu` FROM `main` WHERE `id` IN(SELECT `id` FROM `change` WHERE `type`=1)
    sql_query_killlist = SELECT `id` FROM `change` WHERE `type`=1
}

3/8主索引

index main
{
    source            = main
    path            = /home/letwang/sphinx/data/main
    min_infix_len        = 2
    ngram_len        = 1
    ngram_chars        = U+3000..U+2FA1F
    kbatch = main
}

4/8增量索引

index zengliang:main{
    source = zengliang
    path = /home/letwang/sphinx/data/zengliang
}

5/8實時索引

index shishi
{
    type            = rt
    rt_mem_limit        = 128M
    rt_attr_uint = type
    rt_field = beizhu
    path            = /home/letwang/sphinx/data/shishi
    min_infix_len        = 2
    ngram_len        = 1
    ngram_chars        = U+3000..U+2FA1F
}

6/8分佈式索引

index fenbushi
{
    type = distributed
    agent =127.0.0.1:9312:main           #local = main
    agent =127.0.0.1:9312:zengliang   #local = zengliang
    agent =127.0.0.1:9312:shishi   #local = shishi
}

7/8索引器

indexer
{
    mem_limit        = 1024M
}

8/8守護服務

searchd
{
    listen            = 9312
    listen            = 9306:mysql41
    log            = /home/letwang/sphinx/log/searchd.log
    query_log        = /home/letwang/sphinx/log/query.log
    read_timeout        = 5
    max_children        = 30
    pid_file        = /home/letwang/sphinx/log/searchd.pid
    seamless_rotate        = 1
    preopen_indexes        = 1
    unlink_old        = 1
    workers            = threads
    dist_threads = 4
    binlog_path        = /home/letwang/sphinx/data
}

4.重建全量索引

~/sphinx/bin/indexer -c ~/sphinx/etc/sphinx.conf --all --rotate
    
Sphinx 3.0.3 (commit facc3fb)

using config file '/home/letwang/sphinx/etc/sphinx.conf'...
indexing index 'main'...
collected 100000000 docs, 17421.4 MB
sorted 6623.4 Mhits, 100.0% done
total 100000000 docs, 17.42 Gb
total 2819.8 sec, 6.178 Mb/sec, 35464 docs/sec
indexing index 'zengliang'...
collected 0 docs, 0.0 MB
total 0 docs, 0.0 Kb
total 0.0 sec, 0.0 Kb/sec, 0 docs/sec
skipping non-plain index 'shishi'...
skipping non-plain index 'fenbushi'...

5.啓動Sphinx

~/sphinx/bin/searchd -c ~/sphinx/etc/sphinx.conf

Sphinx 3.0.3 (commit facc3fb)

using config file '/home/letwang/sphinx/etc/sphinx.conf'...
listening on all interfaces, port=9312
listening on all interfaces, port=9306
precaching index 'main'
rotating index 'main': success
precaching index 'zengliang'
rotating index 'zengliang': success
precaching index 'shishi'
precached 3 indexes in 0.130 sec

停止服務:
~/sphinx/bin/searchd -c ~/sphinx/etc/sphinx.conf --stopwait

6.SphinxQL查看搜索引擎狀態

➜  ~ mysql -h127.0.0.1 -P9306
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 3.0.3 (commit facc3fb)
mysql> show databases;
Empty set (0.00 sec)
mysql> show tables;
+-----------+-------------+
| Index     | Type        |
+-----------+-------------+
| fenbushi  | distributed |
| main      | local       |
| shishi    | rt          |
| zengliang | local       |
+-----------+-------------+
4 rows in set (0.00 sec)

7.生成增量索引

➜  ~ mysql -uroot -p123456;
mysql> call insert_main(1);

~/sphinx/bin/indexer -c ~/sphinx/etc/sphinx.conf zengliang --rotate

Sphinx 3.0.3 (commit facc3fb)
Copyright (c) 2001-2018, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/home/letwang/sphinx/etc/sphinx.conf'...
indexing index 'zengliang'...
collected 1 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 1 docs, 0.5 Kb
total 0.1 sec, 4.8 Kb/sec, 10 docs/sec
rotating indices: successfully sent SIGHUP to searchd (pid=11713).

8.合併增量索引到主索引(可選操作)

mysql> select * from zengliang;
+-----------+-----------+
| id        | type      |
+-----------+-----------+
| 100000001 | 172620683 |
+-----------+-----------+
1 row in set (0.00 sec)

~/sphinx/bin/indexer -c ~/sphinx/etc/sphinx.conf --merge main zengliang --rotate

Sphinx 3.0.3 (commit facc3fb)
using config file '/home/letwang/sphinx/etc/sphinx.conf'...
merging index 'zengliang' into index 'main'...
merged 7233.8 Kwords
merged in 1590.479 sec
rotating indices: successfully sent SIGHUP to searchd (pid=7718).

9.使用實時索引

➜  ~ mysql -h127.0.0.1 -P9306

mysql> DESC shishi;
+--------+--------+------------+------+
| Field  | Type   | Properties | Key  |
+--------+--------+------------+------+
| id     | bigint |            |      |
| beizhu | field  | indexed    |      |
| type   | uint   |            |      |
+--------+--------+------------+------+
3 rows in set (0.00 sec)

mysql> INSERT INTO `shishi` values (1, '我是中國人', 11);
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO `shishi` values (2, '我要吃飯', 22);
Query OK, 1 row affected (0.01 sec)

mysql> select * from shishi WHERE MATCH('"*我*"');
| id   | type |
+------+------+
|    1 |   11 |
|    2 |   22 |
+------+------+
2 rows in set (0.00 sec)

Tips:你也可以近似瘋狂的把主索引數據切換到實時索引中
mysql> TRUNCATE RTINDEX shishi;
mysql> ATTACH INDEX main TO RTINDEX shishi;

10.搜索分佈式索引

mysql> SELECT * FROM `fenbushi` WHERE MATCH('"*鮮中交貨淮州*"') LIMIT 10;
| id        | type      |
| 100000001 | 172620683 |
1 row in set, 1 warning (1.01 sec)
mysql> select count(*) from fenbushi;
| count(*)  |
| 100000003 |
1 row in set (1.01 sec)
mysql> select count(*) from main;
| count(*)  |
| 100000000 |
1 row in set (0.95 sec)
mysql> select count(*) from zengliang;
| count(*) |
|        1 |
1 row in set (0.00 sec)
mysql> select count(*) from shishi;
| count(*) |
|        2 |
1 row in set (0.00 sec)

11.定時任務

crontab -e
 */1 * * * * /bin/sh ~/sphinx/bin/indexer -c ~/sphinx/etc/sphinx.conf zengliang --rotate
 */720 * * * * /bin/sh ~/sphinx/bin/indexer -c ~/sphinx/etc/sphinx.conf --merge main zengliang --rotate
 30 1 * * *  /bin/sh ~/sphinx/bin/indexer -c ~/sphinx/etc/sphinx.conf --all --rotate
 
每1分鐘執行一遍增量索引
每720分鐘執行一遍合併索引
每天1:30執行整體索引

12.準備搜索

從主索引裏搜索數據
mysql> SELECT * FROM `main` WHERE MATCH('"*僅非確息類*"');
1 row in set (0.95 sec)

從增量索引裏搜索數據
mysql> SELECT * FROM `zengliang` WHERE MATCH('"*僅非確息類*"');
0 row in set (0.01 sec)

從實時索引裏搜索數據
mysql> SELECT * FROM `shishi` WHERE MATCH('"*僅非確息類*"');
0 row in set (0.02 sec)

從分佈式索引裏搜索數據
mysql> SELECT * FROM `fenbushi` WHERE MATCH('"*僅非確息類*"');
1 row in set (0.80 sec)

搜索調試
mysql> SHOW META;
| Variable_name | Value    |
| total         | 1        |
| total_found   | 1        |
| time          | 0.80   |
| keyword[0]    | 僅       |
| docs[0]       | 24152970 |
| hits[0]       | 74754214 |
| keyword[1]    | 非       |
| docs[1]       | 16617532 |
| hits[1]       | 37394418 |
| keyword[2]    | 確       |
| docs[2]       | 23187798 |
| hits[2]       | 49207648 |
| keyword[3]    | 息       |
| docs[3]       | 23188209 |
| hits[3]       | 49235777 |
| keyword[4]    | 類       |
| docs[4]       | 16628887 |
| hits[4]       | 37414147 |
18 rows in set (0.00 sec)

13.總結

性能指標

total 100000000 docs, 17.42 Gb

Ubuntu 14.04 64bit
Intel® Core™ i5-6500 CPU @ 3.20GHz × 4 
Intel® HD Graphics 530 (Skylake GT2) 
2*4G 2133 MHz
ATA Disk Seagate 976.0 GB

屬性篩選:300-400 毫秒
全文檢索:1秒左右

搜索引擎Sphinx億級數據大併發實時搜索通用架構設計方案

  1. 客戶搜索【分佈式索引】,其已包含:【主索引】、【增量索引】、【實時索引】
  2. 定時任務每分鐘更新 【增量索引】,解決幽靈數據問題,達到準實時搜索
  3. 當用戶操作數據時,同步到實時索引中,達到實時搜索;實時索引重啓不會丟失數據

四、附錄

PPT中所用的文件地址

  1. https://baike.baidu.com/item/Sphinx/14627
  2. http://sphinxsearch.com/
  3. http://sphinxsearch.com/wiki/doku.php?id=third_party
  4. http://php.net/manual/zh/book.sphinx.php
  5. https://db-engines.com/en/ranking/search+engine
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章