問題描述
最近通知應用在近三個月內出現過2次DNS緩存的問題,第一次在重啓之後一直沒有出現過問題,所以也沒有去重視,但是最近又出現過一次,看來很有必要徹底排查一次;具體的錯誤日誌如下:
2018-03-16 18:53:59,501 ERROR [DefaultMessageListenerContainer-1] (com.bill99.asap.service.CryptoClient.seal(CryptoClient.java:34))- null
java.lang.NullPointerException
at java.net.InetAddress$Cache.put(InetAddress.java:779) ~[?:1.7.0_79]
at java.net.InetAddress.cacheAddresses(InetAddress.java:858) ~[?:1.7.0_79]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1334) ~[?:1.7.0_79]
at java.net.InetAddress.getAllByName0(InetAddress.java:1248) ~[?:1.7.0_79]
at java.net.InetAddress.getAllByName(InetAddress.java:1164) ~[?:1.7.0_79]
at java.net.InetAddress.getAllByName(InetAddress.java:1098) ~[?:1.7.0_79]
at java.net.InetAddress.getByName(InetAddress.java:1048) ~[?:1.7.0_79]
at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220) ~[?:1.7.0_79]
at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[?:1.7.0_79]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) ~[?:1.7.0_79]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) ~[?:1.7.0_79]
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) ~[?:1.7.0_79]
at sun.net.www.http.HttpClient.New(HttpClient.java:308) ~[?:1.7.0_79]
at sun.net.www.http.HttpClient.New(HttpClient.java:326) ~[?:1.7.0_79]
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:997) ~[?:1.7.0_79]
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:933) ~[?:1.7.0_79]
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:851) ~[?:1.7.0_79]
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1092) ~[?:1.7.0_79]
at org.springframework.ws.transport.http.HttpUrlConnection.getRequestOutputStream(HttpUrlConnection.java:81) ~[spring-ws-core.jar:1.5.6]
at org.springframework.ws.transport.AbstractSenderConnection$RequestTransportOutputStream.createOutputStream(AbstractSenderConnection.java:101) ~[spring-ws-core.jar:1.5.6]
at org.springframework.ws.transport.TransportOutputStream.getOutputStream(TransportOutputStream.java:41) ~[spring-ws-core.jar:1.5.6]
at org.springframework.ws.transport.TransportOutputStream.write(TransportOutputStream.java:60) ~[spring-ws-core.jar:1.5.6]
具體表現就是出現此異常之後連續的出現大量此異常,同時系統節點不可用;
問題分析
1.既然InetAddress$Cache.put報空指針,那就具體看一下源代碼:
if (policy != InetAddressCachePolicy.FOREVER) {
// As we iterate in insertion order we can
// terminate when a non-expired entry is found.
LinkedList<String> expired = new LinkedList<>();
long now = System.currentTimeMillis();
for (String key : ) {
CacheEntry entry = cache.get(key);
if (entry.expiration >= 0 && entry.expiration < now) {
expired.add(key);
} else {
break;
}
}
for (String key : expired) {
cache.remove(key);
}
}
報空指針的的地方就是entry.expiration,也就是說從cache取出來的entry爲null,可以查看cache寫入的地方:
CacheEntry entry = new CacheEntry(addresses, expiration);
cache.put(host, entry);
每次都是new一個CacheEntry然後再put到cache中,不會寫入null進去;此時猜測是多線程引發的問題,cache.keySet()在遍歷的時候同時也進行了remove操作,導致cache.get(key)到一個空值,查看源代碼可以發現一共有兩次對cache進行remove的地方,分別是put方法和get方法,put方法代碼如上,每次在遍歷的時候檢測是否過期,然後統一進行remove操作;還有一處就是get方法,代碼如下:
public CacheEntry get(String host) {
int policy = getPolicy();
if (policy == InetAddressCachePolicy.NEVER) {
return null;
}
CacheEntry entry = cache.get(host);
// check if entry has expired
if (entry != null && policy != InetAddressCachePolicy.FOREVER) {
if (entry.expiration >= 0 && entry.expiration < System.currentTimeMillis()) {
cache.remove(host);
entry = null;
}
}
return entry;
}
類似put方法也是每次在get的時候進行有效期檢測,然後進行remove操作;
所以如果出現多線程問題大概就是:1.同時調用put,get方法,2.多個線程都調用put方法;繼續查看源碼調用put和get的地方,一共有三處分別是:
private static void cacheInitIfNeeded() {
assert Thread.holdsLock(addressCache);
if (addressCacheInit) {
return;
}
unknown_array = new InetAddress[1];
unknown_array[0] = impl.anyLocalAddress();
addressCache.put(impl.anyLocalAddress().getHostName(),
unknown_array);
addressCacheInit = true;
}
/*
* Cache the given hostname and addresses.
*/
private static void cacheAddresses(String hostname,
InetAddress[] addresses,
boolean success) {
hostname = hostname.toLowerCase();
synchronized (addressCache) {
cacheInitIfNeeded();
if (success) {
addressCache.put(hostname, addresses);
} else {
negativeCache.put(hostname, addresses);
}
}
}
/*
* Lookup hostname in cache (positive & negative cache). If
* found return addresses, null if not found.
*/
private static InetAddress[] getCachedAddresses(String hostname) {
hostname = hostname.toLowerCase();
// search both positive & negative caches
synchronized (addressCache) {
cacheInitIfNeeded();
CacheEntry entry = addressCache.get(hostname);
if (entry == null) {
entry = negativeCache.get(hostname);
}
if (entry != null) {
return entry.addresses;
}
}
// not found
return null;
}
cacheInitIfNeeded只在cacheAddresses和getCachedAddresses方法中被調用,用來檢測cache是否已經被初始化了;而另外兩個方法都加了對象鎖addressCache,所以不會多線程問題;
2.猜測外部直接調用了addressCache,沒有使用內部提供的方法
查看源碼可以發現addressCache本身是私有屬性,也不存在對外的訪問方法
private static Cache addressCache = new Cache(Cache.Type.Positive);
那業務代碼中應該也不能直接使用,除非使用反射的方式,隨手搜了一下全局代碼查看關鍵字”addressCache”,搜到了類似如下代碼:
static{
Class clazz = java.net.InetAddress.class;
final Field cacheField = clazz.getDeclaredField("addressCache");
cacheField.setAccessible(true);
final Object o = cacheField.get(clazz);
Class clazz2 = o.getClass();
final Field cacheMapField = clazz2.getDeclaredField("cache");
cacheMapField.setAccessible(true);
final Map cacheMap = (Map)cacheMapField.get(o);
}
通過反射的方式獲取了addressCache對象,然後又獲取了cache對象(cache是一個LinkedHashMap),同時提供了一個類似如下的方法:
public void remove(String host){
cacheMap.remove(host);
}
對外提供了一個清除緩存的方法,而且沒有使用任何加鎖,所以就引發了多線程問題,remove的同時又去調用cache.keySet()遍歷;
但是這種情況和現象不是很匹配,因爲如果剛好remove的時候調用了cache.keySet(),雖然本次會拋異常,下次調用的時候有很大機率不會出現異常,並不會出現連續拋異常,節點直接不可用;
3.猜測addressCache出現了有key值,但是取出的value爲null
這樣的話這個值一直在addressCache中,每次只要獲取address必然報空指針,而且不會被清除,可以做一個測試,測試代碼如下:
public class TEst {
public static void main(String[] args) throws IOException, InterruptedException {
final LinkedHashMap<Integer, HH> map = new LinkedHashMap<>();
new Thread(new Runnable() {
@Override
public void run() {
for (int i = 0; i < 2000; i++) {
map.put(new Random().nextInt(1000), new HH(new Random(100).nextInt()));
}
}
}).start();
for (int i = 0; i < 100; i++) {
new Thread(new Runnable() {
@Override
public void run() {
for (int i = 0; i < 500; i++) {
map.remove(new Random().nextInt(1000));
}
}
}).start();
}
Thread.sleep(2000);
System.out.println("size=" + map.keySet().size() + "," + map.keySet());
for (Integer s : map.keySet()) {
System.out.println(map.get(s));
}
}
}
class HH {
private int k;
public HH(int k) {
this.k = k;
}
public int getK() {
return k;
}
public void setK(int k) {
this.k = k;
}
}
模擬單線程put操作,業務端會有多條線程同時remove操作,執行看輸出結果(可以執行多次看結果):
size=0,[121, 517, 208]
null
null
null
可以發現會出現猜測的情況,HashMap中的size屬性本身不是線程安全的,所以多線程的情況下有可能出現0,這樣導致get方法獲取都爲null,當然HashMap還有很多其他的多線程問題,因爲HashMap也不是爲多線程準備的,至此大概瞭解了原因。
問題解決
給反射獲取的cache對象加上和cacheAddresses方法同樣的鎖,或者直接不在業務代碼中處理cache對象;
可以借鑑一下阿里在github開源的操作dns緩存的項目:https://github.com/alibaba/java-dns-cache-manipulator
總結
本次排查問題花了一些時間在排查是不是jdk提供的類是不是有bug,這其實是有些浪費時間的;還有就是在排查問題中不要放過任何一種可能往往問題就發生在那些理所當然的地方。