The DOT Language

緣起

在學習著名的Graphviz的工具中dot時,看到這篇語言描述,不長,就翻譯了一下。翻譯方法依然是帶監督的機器學習,可惜的就是這個監督是不可反饋的。

正文

1. Introduction

The following is an abstract grammar defining the DOT language. Terminals are shown in bold font and nonterminals in italics. Literal characters are given in single quotes. Parentheses ( and ) indicate grouping when needed. Square brackets [ and ] enclose optional items. Vertical bars | separate alternatives.

下面是一個抽象的語法定義了DOT的語言。 終結符以粗體字體表示和非終結符用斜體表示,文字字符在單引號中給出,圓括號表示在需要的時候進行分組,方括號包含可選項目,豎線|分隔可選項。備註:下面的這種定義語言語法的表達式稱爲擴展巴科斯範式Extended Backus–Naur Form,EBNF),關於EBNF的介紹可以參考編譯原理相關的書籍(國外的龍書,虎書,鯨書,國內的看陳火旺或馮博琴的),當然,如果僅僅是瞭解的話,可以看看wikipedia中的介紹(Extended Backus–Naur Form)。

graph

:

[ strict ] (graph | digraph) [ ID ] '{' stmt_list '}'

stmt_list

:

[ stmt [ ';' ] [ stmt_list ] ]

stmt

:

node_stmt

 

|

edge_stmt

 

|

attr_stmt

 

|

ID '=' ID

 

|

subgraph

attr_stmt

:

(graph | node | edge) attr_list

attr_list

:

'[' [ a_list ] ']' [ attr_list ]

a_list

:

ID [ '=' ID ] [ ',' ] [ a_list ]

edge_stmt

:

(node_id | subgraph) edgeRHS [ attr_list ]

edgeRHS

:

edgeop (node_id | subgraph) [ edgeRHS ]

node_stmt

:

node_id [ attr_list ]

node_id

:

ID [ port ]

port

:

':' ID [ ':' compass_pt ]

 

|

':' compass_pt

subgraph

:

[ subgraph [ ID ] ] '{' stmt_list '}'

compass_pt

:

(n | ne | e | se | s | sw | w | nw | c | _)

The keywords node, edge, graph, digraph, subgraph, and strict are case-independent. Note also that the allowed compass point values are not keywords, so these strings can be used elsewhere as ordinary identifiers and, conversely, the parser will actually accept any identifier。

關鍵字node,edge,graph,digraph,subgraph以及strict說是case-independent。需要注意的是允許compass點值不是關鍵字,因此這些字符串可在其他地方用作普通的標識符,並且相反地,解析器實際上接受任何標識符。(這裏不是很明白在說什麼)

An ID is one of the following:

  • l Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores ('_') or digits ([0-9]), not beginning with a digit;
  • l ID標識符可爲下列之一:字母表([\ 377 A-ZA-Z \ 200])中的任何字符串,下劃線(“_”)或數字([0-9]),但不可以數字開頭; 注:這裏的定義和C/C++中關於標識符的定義類似,實際上dot語言就是使用的一種類似C的語法,註釋,標識符都很相似,而且dot語言(GraphViz)就是由CC++的發明者所在的AT&T開發的。
  • l a numeral [-]?(.[0-9]+ | [0-9]+(.[0-9]*)? );
  • l 數字[-]?(.[0-9]+ | [0-9]+(.[0-9]*)? );
  • l any double-quoted string ("...") possibly containing escaped quotes (\");
  • l 可能包含轉義引號(\”)的任意雙引號引用的字符串
  • l HTML字符串(<...>an HTML string (<...>).

(備註:上面的內容中包含正則表達式,?表示01個,+表示一個或多個,*表示0個或多個)

An ID is just a string; the lack of quote characters in the first two forms is just for simplicity. There is no semantic difference between abc_2 and "abc_2", or between 2.34 and "2.34". Obviously, to use a keyword as an ID, it must be quoted. Note that, in HTML strings, angle brackets must occur in matched pairs, and unescaped newlines are allowed. In addition, the content must be legal XML, so that the special XML escape sequences for ", &, <, and > may be necessary in order to embed these characters in attribute values or raw text.

ID只是一個字符串第一二種不使用引號形式僅僅是爲了簡單起見。還有就是,abc_2“abc_2”2.34“2.34”之間沒有語義差異 。很顯然,使用關鍵字作爲ID時,則必須使用引號。需要注意的是,在HTML中的字符串,尖括號必須成對出現,並且允許未轉義的換行符。此外,其內容必須是合法的XML此時特殊的XML轉義序列“,&,<>可能是必要的,以便在屬性值或原始文本中嵌入這些字符。

Both quoted strings and HTML strings are scanned as a unit, so any embedded comments will be treated as part of the strings.

引號字符串和HTML字符串都被視爲一個單元,所以任何嵌入的註釋將被當作字符串的一部分。

An edgeop is -> in directed graphs and -- in undirected graphs.

edgop->時表示有向圖,--表示無向圖。

The language supports C++-style comments: /* */ and //. In addition, a line beginning with a '#' character is considered a line output from a C preprocessor (e.g., # 34 to indicate line 34 ) and discarded.

dot語言支持C++風格的註釋:/**///此外,以''字符開頭的行被看作C預處理器的輸出(例如,#34表示第34行),dot將忽視它

Semicolons aid readability but are not required except in the rare case that a named subgraph with no body immediately preceeds an anonymous subgraph, since the precedence rules cause this sequence to be parsed as a subgraph with a heading and a body. Also, any amount of whitespace may be inserted between terminals.

分號可以提高可讀性,但不是必需的,除非在極少數情況下,一個沒有身體的命名子圖立刻被處理爲匿名子圖,因爲優先級規則導致該序列被解析爲一個帶有標題和身體的子圖。此外,在終結符之間可以插入任意數量的空白符。

As another aid for readability, dot allows single logical lines to span multiple physical lines using the standard C convention of a backslash immediately preceding a newline character. In addition, double-quoted strings can be concatenated using a '+' operator. As HTML strings can contain newline characters, they do not support the concatenation operator.

另一個增強可讀性的是dot允許單個邏輯行使用標準C\加換行來跨越多個物理行。此外,雙引號字符串可以使用“+”操作符來串連。由於HTML字符串可以包含換行符,所以HTML不支持連接運算符。

2. 子圖和羣集 Subgraphs and Clusters

Subgraphs play three roles in Graphviz. First, a subgraph can be used to represent graph structure, indicating that certain nodes and edges should be grouped together. This is the usual role for subgraphs and typically specifies semantic information about the graph components.

子圖在Graphviz中發揮的三個作用。首先,將子圖可用於表示某些節點和邊應該被組合在一起的圖形結構。這是子圖通常被指定關於圖形元件的語義信息。

In the second role, a subgraph can provide a context for setting attributes. For example, a subgraph could specify that blue is the default color for all nodes defined in it. In the context of graph drawing, a more interesting example is:

在第二個角色,子圖可以提供設置屬性的上下文。例如,子圖可以對其中定義的所有節點將藍色用作默認顏色。在圖形繪製的上下文中,一個更有趣的例子是:


subgraph { 
rank = same; A; B; C; 
} 


This (anonymous) subgraph specifies that the nodes A, B and C should all be placed on the same rank if drawn using dot.

這(匿名)子圖指定dot繪製的節點ABC都應該放在在同一等級。

The third role for subgraphs directly involves how the graph will be laid out by certain layout engines. If the name of the subgraph begins with cluster, Graphviz notes the subgraph as a special cluster subgraph. If supported, the layout engine will do the layout so that the nodes belonging to the cluster are drawn together, with the entire drawing of the cluster contained within a bounding rectangle. Note that, for good and bad, cluster subgraphs are not part of the DOT language, but solely a syntactic convention adhered to by certain of the layout engines.

子圖的第三個角色直接涉及圖形將如何被某些佈局引擎佈局。如果子圖命令以cluster開頭,Graphviz將子圖作爲一種特殊的集羣子圖(cluster subgraph)。如果佈局引擎支持集羣子圖,引擎在執行的佈局時,將屬於該集羣中的節點繪製在一起並將其包含在一個矩形邊框中。注意集羣子圖不是DOT語言的一部分,僅僅是該語法約定被某些佈局引擎的堅持。

3. 詞法和語義註釋 Lexical and Semantic Notes

If a default attribute is defined using a node, edge, or graph statement, or by an attribute assignment not attached to a node or edge, any object of the appropriate type defined afterwards will inherit this attribute value. This holds until the default attribute is set to a new value, from which point the new value is used. Objects defined before a default attribute is set will have an empty string value attached to the attribute once the default attribute definition is made.

如果在節點,邊,圖形語句或沒有附加到任何到節點或邊上屬性賦值語句中定義了默認屬性,此後,任何適當的類型的對象都將繼承這個屬性值。在被設置爲一個新的值前,默認屬性將一直保持。對象在默認屬性被設置前定義時,對象的該屬性爲空的字符串值,在默認屬性設置後,被覆蓋爲新值

Note, in particular, that a subgraph receives the attribute settings of its parent graph at the time of its definition. This can be useful; for example, one can assign a font to the root graph and all subgraphs will also use the font. For some attributes, however, this property is undesirable. If one attaches a label to the root graph, it is probably not the desired effect to have the label used by all subgraphs. Rather than listing the graph attribute at the top of the graph, and the resetting the attribute as needed in the subgraphs, one can simple defer the attribute definition if the graph until the appropriate subgraphs have been defined.

請注意,特別是子圖在定義時接收父圖的屬性設置。這很有用的例如,在根圖中指定字體,所有子圖也將使用該字體。然而,對於某些屬性,這是不可取的。如果一個標籤附加到根圖中,沒必要所有子圖都使用這個標籤。與其在圖的頂部列出的圖表屬性,不如根據需要在子圖中設置屬性。簡單的延遲屬性定義,直到適當的子圖中定義。

If an edge belongs to a cluster, its endpoints belong to that cluster. Thus, where you put an edge can effect a layout, as clusters are sometimes laid out recursively.

如果邊屬於羣集,其端點也屬於該集羣。因此,由於集羣有時會遞歸佈局,將邊緣放到其中可以影響佈局。

There are certain restrictions on subgraphs and clusters. First, at present, the names of a graph and it subgraphs share the same namespace. Thus, each subgraph must have a unique name. Second, although nodes can belong to any number of subgraphs, it is assumed clusters form a strict hierarchy when viewed as subsets of nodes and edges.

在子圖和集羣中存在一些限制。 首先,目前,圖的名稱和它的子圖共享同一命名空間。因此,每個子圖必須有一個唯一的名稱。第二,雖然節點可以屬於任意數量的子圖,當集羣被看作節點和邊的子集,假設集羣形成嚴格的層次結構。

4. 字符編碼 Character encodings

The DOT language assumes at least the ascii character set. Quoted strings, both ordinary and HTML-like, may contain non-ascii characters. In most cases, these strings are uninterpreted: they simply serve as unique identifiers or values passed through untouched. Labels, however, are meant to be displayed, which requires that the software be able to compute the size of the text and determine the appropriate glyphs. For this, it needs to know what character encoding is used.

DOT的語言假定至少是ASCII字符集。帶引號的字符串(包括普通和類似HTML的字符串)可能包含非ASCII字符。在大多數情況下,這些字符串是不解釋:他們只是充當唯一標識符或傳遞中不變的值。但是,標籤必須要顯示,這就要求該軟件能夠計算的文本的大小,並確定適當的字形。爲此,它需要知道使用的字符編碼。

By default, DOT assumes the UTF-8 character encoding. It also accepts the Latin1 (ISO-8859-1) character set, assuming the input graph uses the charset attribute to specify this. For graphs using other character sets, there are usually programs, such as iconv, which will translate from one character set to another.

默認情況下,dot假設使用UTF-8字符編碼。在輸入圖中使用的字符集屬性指定時,dot也接受了拉丁文(ISO-8859-1)字符集。使用其他字符集的圖,存在一些如iconv的軟件,可以將轉換字符集。

 Another way to avoid non-ascii characters in labels is to use HTML entities for special characters. During label evaluation, these entities are translated into the underlying character. This table shows the supported entities, with their Unicode value, a typical glyph, and the HTML entity name. Thus, to include a lower-case Greek beta into a string, one can use the ascii sequence β. In general, one should only use entities that are allowed in the output character set, and for which there is a glyph in the font.

另一種方式避免在標籤中使用非ASCII字符是使用HTML實體的特殊字符。在標籤的計算中,這些實體被翻譯成、底層字符。 下顯示了支持的實體,表中展示了Unicode值,典型的字形和HTML實體名稱。因此,爲包括小寫希臘的β成一個字符串,可以使用ASCII字符序列&beta。一般情況下,應該只使用被允許在輸出字符集中以及在在字體的字形中的。

 

In quoted strings in DOT, the only escaped character is double-quote ("). That is, in quoted strings, the dyad \" is converted to "; all other characters are left unchanged. In particular, \\ remains \\. Layout engines may apply additional escape sequences.

在dot帶引號的字符串,唯一的轉義字符是雙引號(),也就是說,在帶引號的字符串,\“轉換爲”;所有其它字符都保持不變。特別是,\\依然是\\。佈局引擎可能會應用額外轉義序列。

後記

Dot語言來自http://hughesbennett.co.uk/Graphviz的引用文檔,除了dot語言外,還有一系列的文檔,例如結點和邊的屬性結點的形狀顏色的名字,此外hughesbennet網站本身很有趣,國外作者還真是熱心加性情。後來發現,hughesbennet上的dot語言及其相關屬性的文檔來自GraphViz官方,但是,官方的文檔的排版不如hughesbennet中的簡潔大方。

Graphviz的中文處理問題:

1.保存爲UTF8格式

2.設置中文字體,一些中文字體的對應的名字如下:

黑體:SimHei 宋體:SimSun 新宋體:NSimSun 仿宋:FangSong 楷體:KaiTi

仿宋_GB2312FangSong_GB2312 楷體_GB2312KaiTi_GB2312

此外,Ubuntu中系統安裝了那些字體可以通過命令fc-list查看。

參考文獻:

維基百科Graphviz:http://en.wikipedia.org/wiki/Graphviz

Graphviz項目的主頁:http://graphviz.org/ 

官方文檔地址http://www.graphviz.org/Documentation.php

Graphvizhttp://hughesbennett.co.uk/Graphviz

Graphviz中文亂碼問題

使用 Graphviz 生成自動化系統圖:http://www.ibm.com/developerworks/cn/aix/library/au-aix-graphviz/

 

發佈了66 篇原創文章 · 獲贊 18 · 訪問量 28萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章