java xml轉義方法以及中文字符的處理

對於xml的轉義最方便，最簡單的方法就是直接使用apache的commons.lang jar包中的StringEscapeUtils的escapeXml方法。但該方法在commons lang 2.x和commons lang 3.x的處理方式不太一樣。

在commons lang 2.x中StringEscapeUtils的escapeXml方法除了會對xml中的“，&，<，>和‘等字符進行轉義外，還會對unicode編碼大於0x7F的字符進行轉義。

在StringEscapeUtils中創建了xml Entities對象。在該對象中添加了了BASIC_ARRAY和APOS_ARRAY中定義的字符，如果碰到這些字符就會進行轉義。

BASIC_ARRAY中定義了

private static final String[][] BASIC_ARRAY = {{"quot", "34"}, // " - double-quote
        {"amp", "38"}, // & - ampersand
        {"lt", "60"}, // < - less-than
        {"gt", "62"}, // > - greater-than
    };

APOS_ARRAY中定義了

private static final String[][] APOS_ARRAY = {{"apos", "39"}, // XML apostrophe
    };

因此會對這些定義的字符進行轉義。escapeXml方法調用Entities.XML.escape的方法進行轉義的具體操作

public void escape(Writer writer, String str) throws IOException {
        int len = str.length();
        for (int i = 0; i < len; i++) {
            char c = str.charAt(i);
            String entityName = this.entityName(c);
            if (entityName == null) {
                if (c > 0x7F) {
                    writer.write("&#");
                    writer.write(Integer.toString(c, 10));
                    writer.write(';');
                } else {
                    writer.write(c);
                }
            } else {
                writer.write('&');
                writer.write(entityName);
                writer.write(';');
            }
        }
    }

可以看出還對Unicode編碼大於ox7F的字符進行了轉義。因此使用該方法會使得中文字符也會被轉義。

如果不想使用中文字符被轉義，要麼自己可以參考上面的代碼，自己改寫，去掉對大於0x7F的字符的轉義，要麼可以使用commons lang3中的escapeXml相關方法。commons lang3中對方法使用策略模式進行了重新設計。相關的方法有escapeXml、escapeXml10和escapeXml11。

其中escapeXml方法已經被廢棄。該方法只轉義xml中的“，&，<，>和‘5個字符進行轉義。將new LookupTranslator(EntityArrays.BASIC_ESCAPE())和new LookupTranslator(EntityArrays.APOS_ESCAPE())兩個Tranlator註冊到ESCAPE_XML上

escapeXml10方法除了對上述5個字符進行轉義外，還會將一些控制字符，例如\b、\t、\n、\r等等替換成空字符串。因爲XML1.0是純文本格式，不能表示控制字符。另外對於不成對的代理碼點也不能表示，因此會去除掉。因此註冊到escapeXml10的Translator除了new LookupTranslator(EntityArrays.BASIC_ESCAPE())和new LookupTranslator(EntityArrays.APOS_ESCAPE())外，還有

new LookupTranslator(
            new String[][] {
                    { "\u0000", "" }, { "\u0001", "" }, { "\u0002", "" }, { "\u0003", "" }, { "\u0004", "" }, { "\u0005", "" }, { "\u0006", "" }, { "\u0007", "" }, { "\u0008", "" },
                    { "\u000b", "" }, { "\u000c", "" }, { "\u000e", "" }, { "\u000f", "" }, { "\u0010", "" }, { "\u0011", "" }, { "\u0012", "" }, { "\u0013", "" }, { "\u0014", "" },
                    { "\u0015", "" }, { "\u0016", "" }, { "\u0017", "" }, { "\u0018", "" }, { "\u0019", "" }, { "\u001a", "" }, { "\u001b", "" }, { "\u001c", "" }, { "\u001d", "" },
                    { "\u001e", "" }, { "\u001f", "" }, { "\ufffe", "" }, { "\uffff", "" }
            }),
    和
    new UnicodeUnpairedSurrogateRemover()。

一個是用來處理控制字符，一個是用來處理未成對的代理碼點,移除掉碼值在[#xD8000,#xDFFF]之間的碼值字符。也就是escapeXml10會移除不在下面碼值範圍內的所有碼值：

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]。

另外escapeXml10還註冊了NumericEntityEscaper.between(0x7f, 0x84)和NumericEntityEscaper.between(0x86, 0x9f)兩個Translator，將[#x7F-#x84] | [#x86-#x9F]}兩個範圍內的字符進行轉義。

對於escapeXml11，由於XML 1.1可以表示一定的控制字符，所以對於控制字符的Translator和escapeXml10不太相同。

new LookupTranslator(
    new String[][] {
            { "\u0000", "" },
            { "\u000b", "" },
            { "\u000c", "" },
            { "\ufffe", "" },
            { "\uffff", "" }
})

escapeXml11將會移除不在下面碼值範圍內的所有碼值：

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

escapeXml11還註冊了

NumericEntityEscaper.between(0x1, 0x8),
NumericEntityEscaper.between(0xe, 0x1f),
NumericEntityEscaper.between(0x7f, 0x84),
NumericEntityEscaper.between(0x86, 0x9f),

四個Translator，這樣將會對在#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]範圍內的碼值進行轉義。

所使用的主要函數就是這三個。下面說一下其大概的一個工作原理。

對於這三個函數都分別使用了不同的Translator。不過都是AggregateTranslator類的對象。從這個類的名字也可以看出這是個集成Translator，作用就是調用其中註冊的一組Translator。所有的Translator都繼承自CharSequenceTranslator抽象類，在轉義方法中都是直接調用了CharSequenceTranslator的

/**
     * Helper for non-Writer usage. 
     * @param input CharSequence to be translated
     * @return String output of translation
     */
    public final String translate(final CharSequence input) {
        if (input == null) {
            return null;
        }
        try {
            final StringWriter writer = new StringWriter(input.length() * 2);
            translate(input, writer);
            return writer.toString();
        } catch (final IOException ioe) {
            // this should never ever happen while writing to a StringWriter
            throw new RuntimeException(ioe);
        }
    }

方法，這個方法又調用了

      /**
     * Translate an input onto a Writer. This is intentionally final as its algorithm is 
     * tightly coupled with the abstract method of this class. 
      *
     * @param input CharSequence that is being translated
     * @param out Writer to translate the text to
     * @throws IOException if and only if the Writer produces an IOException
     */
    public final void translate(final CharSequence input, final Writer out) throws IOException {
        if (out == null) {
            throw new IllegalArgumentException("The Writer must not be null");
        }
        if (input == null) {
            return;
        }
        int pos = 0;
        final int len = input.length();
        while (pos < len) {
		//從pos位置開始，對該位置開始的字符進行遍歷轉義，並返回轉義的代碼點的個數。注意是代碼點，而不是char的個數或者代碼單元的個數，
		//這個函數在CharSequenceTranslator是個虛函數，需要各繼承類實現。並約定每個繼承類需要處理碼值代理對
		//關於碼值代理對的概念，可以參考我的另一篇博文“java char String中涉及到的length字符長度概念的研究”
            final int consumed = translate(input, pos, out);
            if (consumed == 0) {   //說明調用的traslator沒有需要處理的轉移字符
                // inlined implementation of Character.toChars(Character.codePointAt(input, pos))
                // avoids allocating temp char arrays and duplicate checks
                char c1 = input.charAt(pos);
                out.write(c1);
                pos++;
		    //如果當前位置是個代理對碼值，那麼就需要把該輔助字符的第一和第二部分同時處理輸出
                if (Character.isHighSurrogate(c1) && pos < len) {
                    char c2 = input.charAt(pos);
                    if (Character.isLowSurrogate(c2)) {
                      out.write(c2);
                      pos++;
                    }
                }
                continue;
            }
            // contract with translators is that they have to understand codepoints
            // and they just took care of a surrogate pair
		//consumed應該表示的是代碼點的數量，因此需要獲取當前位置的代碼點的代碼單元的個數，然後將pos指向需要處理的下一個代碼點
            for (int pt = 0; pt < consumed; pt++) {
                pos += Character.charCount(Character.codePointAt(input, pos));
            }
        }
    }

該方法又調用了方法

	/**
	* Translate a set of codepoints, represented by an int index into a CharSequence, 
	* into another set of codepoints. The number of codepoints consumed must be returned, 
	* and the only IOExceptions thrown must be from interacting with the Writer so that 
	* the top level API may reliably ignore StringWriter IOExceptions. 
	*
	* @param input CharSequence that is being translated
	* @param index int representing the current point of translation
	* @param out Writer to translate the text to
	* @return int count of codepoints consumed
	* @throws IOException if and only if the Writer produces an IOException
	*/
	public abstract int translate(CharSequence input, int index, Writer out) throws IOException;

這是個虛函數，繼承該類都需要實現。在AggregateTranslator的translate方法中就能直接調用集成在這裏面的其它對象的translate方法。

AggregateTranslator的translate方法如下：

	/**
	* The first translator to consume codepoints from the input is the 'winner'. 
	* Execution stops with the number of consumed codepoints being returned. 
	* {@inheritDoc}
	*/
	@Override
	public int translate(final CharSequence input, final int index, final Writer out) throws IOException {
		for (final CharSequenceTranslator translator : translators) {
		    final int consumed = translator.translate(input, index, out);
		    if(consumed != 0) {
		        return consumed;
		    }
		}
		return 0;
    	}

此外，再看一下用的比較頻繁的LookupTranslator的實現。
該類的構造函數對傳進來的字符映射表進行遍歷處理，將二元數組的映射錶轉換成map保存在lookupMap結構中，便於後續的查找處理，找出每個映射組的前綴保存在prefxSet中。並記錄每個二元數組中字符長度最長的和最短的保存在longest和shortest變量中。
其繼承實現的translate函數如下：

@Override
    public int translate(final CharSequence input, final int index, final Writer out) throws IOException {
           //從	input的index位置進行比較，只要找到一個就返回
	  // check if translation exists for the input at position index
        if (prefixSet.contains(input.charAt(index))) {
            int max = longest;
            if (index + longest > input.length()) {
                max = input.length() - index;
            }
		//先從最長的字符串進行匹配
            // implement greedy algorithm by trying maximum match first
            for (int i = max; i >= shortest; i--) {
                final CharSequence subSeq = input.subSequence(index, index + i);
                final String result = lookupMap.get(subSeq.toString());

                if (result != null) {
                    out.write(result);
                    return i;
                }
            }
        }
        return 0;
    }

具體實現就是這樣子的。但是我認爲此函數有問題。因爲它返回的是char的length而不是代碼點的長度。如果lookupTable中的key是含有輔助字符的，在CharSequenceTranslator的tanslate函數處理地方：

	// contract with translators is that they have to understand codepoints
	// and they just took care of a surrogate pair
	for (int pt = 0; pt < consumed; pt++) {
		pos += Character.charCount(Character.codePointAt(input, pos));
	}

應該就會有bug了。這裏需要注意一下。
好了，現在對於escapeXml相關函數的工作原理了解清楚了。其實質就是創建CharSequenceTranslator，調用translate函數進行轉義。其實我們也可以根據自己的需要組合出自己的CharSequenceTranslator來進行轉義，而不調用定製的escapeXml函數。

java xml轉義方法以及中文字符的處理

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

啓動報Cannot get connection for url jdbc xxxx listener could not hand off client co

安裝kali linux後遇到的問題解決方案

spring動態數據源配置以及以及利用AOP自動設置

C++中使用數組作爲map容器VAlue值的解決方法

Hibernate實戰讀書摘要(3)—繼承和定製類型

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結