linux C中使用正則表達式

linux C中使用正則表達式

  對應的頭文件是 #include <regex.h>

編譯正則表達式 regcomp()

函數原型:

int regcomp(regex_t *preg, const char *regex, int cflags);
  • regex_t 是一個結構體數據類型,用來存放編譯後的正則表達式,其結構爲
  struct re_pattern_buffer
  {
    /* Space that holds the compiled pattern.  It is declared as
       `unsigned char *' because its elements are sometimes used as
       array indexes.  */
    unsigned char *__REPB_PREFIX(buffer);

    /* Number of bytes to which `buffer' points.  */
    unsigned long int __REPB_PREFIX(allocated);

    /* Number of bytes actually used in `buffer'.  */
    unsigned long int __REPB_PREFIX(used);

    /* Syntax setting with which the pattern was compiled.  */
    reg_syntax_t __REPB_PREFIX(syntax);

    /* Pointer to a fastmap, if any, otherwise zero.  re_search uses the
       fastmap, if there is one, to skip over impossible starting points
       for matches.  */
    char *__REPB_PREFIX(fastmap);                                                                                                                                                                           

    /* Either a translate table to apply to all characters before
       comparing them, or zero for no translation.  The translation is
       applied to a pattern when it is compiled and to a string when it
       is matched.  */
    __RE_TRANSLATE_TYPE __REPB_PREFIX(translate);

    /* Number of subexpressions found by the compiler.  */
    size_t re_nsub;

    /* Zero if this pattern cannot match the empty string, one else.
       Well, in truth it's used only in `re_search_2', to see whether or
       not we should use the fastmap, so we don't set this absolutely
       perfectly; see `re_compile_fastmap' (the `duplicate' case).  */
    unsigned __REPB_PREFIX(can_be_null) : 1;

    /* If REGS_UNALLOCATED, allocate space in the `regs' structure
       for `max (RE_NREGS, re_nsub + 1)' groups.
       If REGS_REALLOCATE, reallocate space if necessary.
       If REGS_FIXED, use what's there.  */
  #ifdef __USE_GNU
  # define REGS_UNALLOCATED 0
  # define REGS_REALLOCATE 1
  # define REGS_FIXED 2
  #endif
    unsigned __REPB_PREFIX(regs_allocated) : 2;

    /* Set to zero when `regex_compile' compiles a pattern; set to one
       by `re_compile_fastmap' if it updates the fastmap.  */
    unsigned __REPB_PREFIX(fastmap_accurate) : 1;

    /* If set, `re_match_2' does not return information about
       subexpressions.  */
    unsigned __REPB_PREFIX(no_sub) : 1;

    /* If set, a beginning-of-line anchor doesn't match at the beginning
       of the string.  */
    unsigned __REPB_PREFIX(not_bol) : 1;

    /* Similarly for an end-of-line anchor.  */
    unsigned __REPB_PREFIX(not_eol) : 1;

    /* If true, an anchor at a newline matches.  */
    unsigned __REPB_PREFIX(newline_anchor) : 1;
  };

  • preg 就是指向regex_t類型結構體的指針。用來存放編譯後的正則匹配式。
  • regex 是指向我們寫好的正則表達式的指針。
  • cflags 有如下幾個值,用來調控正則表達式,可以使用0個或者多個

REG_EXTENDED :設置後使用擴展正則表達式
REG_ICASE:設置後不區分大小寫
REG_NOSUB: 設置後不回覆匹配成功的位置
REG_NEWLINE: 設置後匹配任意字符,不識別換行符

匹配正則表達式 regexec()

regexec的原型

int regexec(const regex_t *preg, const char *string, size_t nmatch,regmatch_t pmatch[], int eflags);
  • greg是編譯後的正則表達式
  • string是目標文本串
  • nmatch是regmatch_t結構體數組的長度
  • regmatch_t結構體數組用來存儲匹配到的結果數據,其結構體原型如下:
typedef struct {
               regoff_t rm_so;
               regoff_t rm_eo;
           } regmatch_t;

rm_so 表示匹配文本串在目標文本串的開始位置
rm_eo 表示匹配文本串在目標文本串的結束位置
  • eflags有兩個值
REG_NOTBOL:匹配行的開始符號始終不匹配,參考上面的REG_NEWLINE.
The match-beginning-of-line operator always fails to match  (but see  the  compilation flag REG_NEWLINE above).  This flag may be used when different portions of a string are passed to regexec() and the beginning of the string should not be interpreted as the beginning of the line.
REG_NOTEOL:匹配行的結束符號始終不匹配,參考上面的REG_NEWLINE.

釋放正則表達式 regfree()

regfree()原型

void regfree(regex_t *preg);

當使用完編譯好的正則表達式後,或者要重新編譯其他正則表達式的時候可以用這個函數清空regex_t結構體裏面的內容。

測試用例

以下用例是從讀取一個文本,並從文本中匹配目標字符串

#include <stdio.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <regex.h>
int main(int argc,char** argv)
{
    int n,len,count;
    int t;
    char buffer[512];
    regmatch_t pmatch[2];
    const size_t nmatch = 2;
    regex_t reg;
    char str[1024*1024];
    char *p = NULL;
    const char * pattern = "href=\"\\s*/\(book/[0-9]{1,6}\)\\s*";
    memset(str,0,sizeof(str));
    n = 0;
    count = 0;
    int fd = open(argv[1],O_RDONLY);
    if(fd < 0)
    {
        printf("file: %s open error\n",argv[1]);
        return -1;
    }
    while ((n = read(fd, str+count,1024)) != 0){
        if (n == -1)
        {
            printf("file read error\n");
            return -1;
        }
        count += n;
    }
    close(fd);
    printf("\nfile read over! begn URL analyse now...\n");

    p = str;

    if((t = regcomp(&reg,pattern,REG_EXTENDED)) != 0)
    {
        regerror(t, &reg, buffer, sizeof buffer);
        fprintf(stderr,"grep: %s (%s)\n",buffer,pattern);
        return -1;
    }
    fprintf(stderr,"grep: %s (%s)\n",buffer,pattern);//查看系統中的正則表達式
    while(regexec(&reg,p,nmatch,pmatch,0) != REG_NOMATCH)
    {
        len = (pmatch[1].rm_eo - pmatch[1].rm_so);
        p = p + pmatch[1].rm_so;
        char *tmp = (char *)calloc(len+1,1);
        strncpy(tmp,p,len);
        tmp[len] = '\0';
        p = p + len + (pmatch[0].rm_eo - pmatch[1].rm_eo);
        printf("%s\n",tmp);
    }
    return 0;
}

運行截圖:

在這裏插入圖片描述

測試文本:
鏈接: https://pan.baidu.com/s/1qASJOJ-XyBdViElONmG5xw 提取碼: tr4t

發佈了30 篇原創文章 · 獲贊 24 · 訪問量 1萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章