【Hive】Inspector

ObjectInspector幫助我們研究複雜對象的內部結構，解耦了數據使用和數據格式，從而提高了代碼的複用度。

一個ObjectInspector實例代表了一個類型的數據在內存中存儲的特定類型和方法。

一個ObjectInspector對象本身並不包含任何數據，它只是提供對數據的存儲類型說明和對數據對象操作的統一管理或者是代理

ObjectInspector接口使得Hive不拘於一種特定數據格式，使得數據流：

輸入端和輸出端切換不同的輸入輸出格式
在不同的Operator上使用不同的數據格式

一個枚舉類Category，定義了5種類型：

基本類型（Primitive），集合（List），鍵值對映射（Map），結構體（Struct），聯合體（Union）

ObjectInspector接口定義：

public interface ObjectInspector extends Cloneable { 
  public static enum Category { 
    PRIMITIVE, LIST, MAP, STRUCT, UNION 
  }; 

  String getTypeName(); 

  Category getCategory(); 
}

ObjectInspector對應的子抽象類和接口分別爲：

StructObjectInspector ：完成對一行數據的解析，本身有一組ObjectInspector組成
MapObjectInspector
ListObjectInspector
PrimitiveObjectInspector ：完成對基本數據類型的解析
UnionObjectInspector

Hive SerDe測試代碼：

//創建schema，保存在Properties中 
  private Properties createProperties() { 
    Properties tbl = new Properties(); 

    // Set the configuration parameters 
    tbl.setProperty(Constants.SERIALIZATION_FORMAT, "9"); 
    tbl.setProperty("columns", 
        "abyte,ashort,aint,along,adouble,astring,anullint,anullstring"); 
    tbl.setProperty("columns.types", 
        "tinyint:smallint:int:bigint:double:string:int:string"); 
    tbl.setProperty(Constants.SERIALIZATION_NULL_FORMAT, "NULL"); 
    return tbl; 
}

public void testLazySimpleSerDe() throws Throwable { 
    try { 
      // Create the SerDe 
      LazySimpleSerDe serDe = new LazySimpleSerDe(); 
      Configuration conf = new Configuration(); 
      Properties tbl = createProperties(); 
      //用Properties初始化serDe 
      serDe.initialize(conf, tbl); 

      // Data 
      Text t = new Text("123\t456\t789\t1000\t5.3\thive and hadoop\t1.\tNULL"); 
      String s = "123\t456\t789\t1000\t5.3\thive and hadoop\tNULL\tNULL"; 
      Object[] expectedFieldsData = {new ByteWritable((byte) 123), 
          new ShortWritable((short) 456), new IntWritable(789), 
          new LongWritable(1000), new DoubleWritable(5.3), 
          new Text("hive and hadoop"), null, null}; 

      // Test 
      deserializeAndSerialize(serDe, t, s, expectedFieldsData); 
    } catch (Throwable e) { 
      e.printStackTrace(); 
      throw e; 
    } 
}

private void deserializeAndSerialize(LazySimpleSerDe serDe, Text t, String s, 
      Object[] expectedFieldsData) throws SerDeException { 
    // Get the row ObjectInspector 
    StructObjectInspector oi = (StructObjectInspector) serDe 
        .getObjectInspector(); 
    // 獲取列信息 
    List<? extends StructField> fieldRefs = oi.getAllStructFieldRefs(); 
    assertEquals(8, fieldRefs.size()); 

    // Deserialize 
    Object row = serDe.deserialize(t); 
    for (int i = 0; i < fieldRefs.size(); i++) { 
      Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i)); 
      if (fieldData != null) { 
        fieldData = ((LazyPrimitive) fieldData).getWritableObject(); 
      } 
      assertEquals("Field " + i, expectedFieldsData[i], fieldData); 
    } 
    // Serialize 
    assertEquals(Text.class, serDe.getSerializedClass()); 
    Text serializedText = (Text) serDe.serialize(row, oi); 
    assertEquals("Serialized data", s, serializedText.toString()); 
}

Hive將對行中列的讀取和行的存儲方式解耦和

對於數據的使用者來說，只需要行的Object和相應的ObjectInspector，就能讀取出每一列的對象

Hive ExprNodeEvaluator 和 UDF，UDAF, UDTF 都需要 (Object, ObjectInspector) pair