新聞中心
ObjectInspector是Hive中一個咋一看比較令人困惑的概念,當(dāng)初讀Hive源代碼時,花了很長時間才理解。 當(dāng)讀懂之后,發(fā)現(xiàn)ObjectInspector作用相當(dāng)大,它解耦了數(shù)據(jù)使用和數(shù)據(jù)格式,從而提高了代碼的復(fù)用程度。 簡單的說,ObjectInspector接口使得Hive可以不拘泥于一種特定數(shù)據(jù)格式,使得數(shù)據(jù)流 1)在輸入端和輸出端切換不同的輸入/輸出格式 2)在不同的Operator上使用不同的數(shù)據(jù)格式。
創(chuàng)新互聯(lián)致力于互聯(lián)網(wǎng)品牌建設(shè)與網(wǎng)絡(luò)營銷,包括成都做網(wǎng)站、網(wǎng)站建設(shè)、SEO優(yōu)化、網(wǎng)絡(luò)推廣、整站優(yōu)化營銷策劃推廣、電子商務(wù)、移動互聯(lián)網(wǎng)營銷等。創(chuàng)新互聯(lián)為不同類型的客戶提供良好的互聯(lián)網(wǎng)應(yīng)用定制及解決方案,創(chuàng)新互聯(lián)核心團(tuán)隊十年專注互聯(lián)網(wǎng)開發(fā),積累了豐富的網(wǎng)站經(jīng)驗,為廣大企業(yè)客戶提供一站式企業(yè)網(wǎng)站建設(shè)服務(wù),在網(wǎng)站建設(shè)行業(yè)內(nèi)樹立了良好口碑。
這是ObjectInspector interface
public interface ObjectInspector extends Cloneable {
public static enum Category {
PRIMITIVE, LIST, MAP, STRUCT, UNION
};
String getTypeName();
Category getCategory();
}
這個interface提供了最一般的方法 getTypeName 和 getCategory。 我們再來看它的子抽象類和interface:
StructObjectInspector
MapObjectInspector
ListObjectInspector
PrimitiveObjectInspector
UnionObjectInspector
其中,PrimitiveObjectInspector用來完成對基本數(shù)據(jù)類型的解析,而StructObjectInspector用了完成對一行數(shù)據(jù)的解析,它本身有一組ObjectInspector組成。 由于Hive支持Nested Data Structure,所以,在StructObjectInspector中又可以(一層或多層的)嵌套任意的ObjectInspector。 Struct, Map, List, Union是Hive支持的4種集合數(shù)據(jù)類型,比如某一列的數(shù)據(jù)可以被聲明為Struct類型,這樣解析這一列的StructObjectInspector中就會嵌套了另一個StructObjectInspector。
現(xiàn)在我們可以從一個小例子看看ObjectInspector是如何工作的,這是一個Hive SerDe的測試用例代碼:
/**
* Test the LazySimpleSerDe class.
*/
public void testLazySimpleSerDe() throws Throwable {
try {
// Create the SerDe
LazySimpleSerDe serDe = new LazySimpleSerDe();
Configuration conf = new Configuration();
Properties tbl = createProperties();
//用Properties初始化serDe
serDe.initialize(conf, tbl);
// Data
Text t = new Text("123\t456\t789\t1000\t5.3\thive and hadoop\t1.\tNULL");
String s = "123\t456\t789\t1000\t5.3\thive and hadoop\tNULL\tNULL";
Object[] expectedFieldsData = {new ByteWritable((byte) 123),
new ShortWritable((short) 456), new IntWritable(789),
new LongWritable(1000), new DoubleWritable(5.3),
new Text("hive and hadoop"), null, null};
// Test
deserializeAndSerialize(serDe, t, s, expectedFieldsData);
} catch (Throwable e) {
e.printStackTrace();
throw e;
}
}
private void deserializeAndSerialize(LazySimpleSerDe serDe, Text t, String s,
Object[] expectedFieldsData) throws SerDeException {
// Get the row ObjectInspector
StructObjectInspector oi = (StructObjectInspector) serDe
.getObjectInspector();
// 獲取列信息
List extends StructField> fieldRefs = oi.getAllStructFieldRefs();
assertEquals(8, fieldRefs.size());
// Deserialize
Object row = serDe.deserialize(t);
for (int i = 0; i < fieldRefs.size(); i++) {
Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i));
if (fieldData != null) {
fieldData = ((LazyPrimitive) fieldData).getWritableObject();
}
assertEquals("Field " + i, expectedFieldsData[i], fieldData);
}
// Serialize
assertEquals(Text.class, serDe.getSerializedClass());
Text serializedText = (Text) serDe.serialize(row, oi);
assertEquals("Serialized data", s, serializedText.toString());
}
//創(chuàng)建schema,保存在Properties中
private Properties createProperties() {
Properties tbl = new Properties();
// Set the configuration parameters
tbl.setProperty(Constants.SERIALIZATION_FORMAT, "9");
tbl.setProperty("columns",
"abyte,ashort,aint,along,adouble,astring,anullint,anullstring");
tbl.setProperty("columns.types",
"tinyint:smallint:int:bigint:double:string:int:string");
tbl.setProperty(Constants.SERIALIZATION_NULL_FORMAT, "NULL");
return tbl;
}
從這個例子中,不難出,Hive將對行中列的讀取和行的存儲方式解耦和了,只有ObjectInspector清楚行和行中的列是怎樣存取的,但使用者并不知道存儲的細(xì)節(jié)。 對于數(shù)據(jù)的使用者來說,只需要行的Object和相應(yīng)的ObjectInspector,就能讀取出每一列的對象。
這段代碼再清晰不過了,ObjectInspector oi控制了對列的Access
for (int i = 0; i < fieldRefs.size(); i++) {
Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i));
if (fieldData != null) {
fieldData = ((LazyPrimitive) fieldData).getWritableObject();
}
assertEquals("Field " + i, expectedFieldsData[i], fieldData);
}
這段代碼的作用是把一行deserialize,然后再serialize
Object row = serDe.deserialize(t);
Text serializedText = (Text) serDe.serialize(row, oi);
由此不難看出,只要有了不同的SerDe對象,可以很容易的將一條數(shù)據(jù)deserialize,然后再serialize成不同的格式,從而非常方便的實現(xiàn)數(shù)據(jù)格式的切換。
理解了上面的例子,就不難理解為什么所有的Hive ExprNodeEvaluator 和 UDF,UDAF, UDTF 都需要 (Object, ObjectInspector) pair了。 數(shù)據(jù)存儲細(xì)節(jié)和使用的分離,使得Hive不需要針對不同的數(shù)據(jù)格式對同一個UDF, UDAF 或UDTF實現(xiàn)不同的版本,這些函數(shù)看到的只是WritableObject!
下面是表達(dá)式evaluator的interface:
/**
* ExprNodeEvaluator.
*
*/
public abstract class ExprNodeEvaluator {
/**
* Initialize should be called once and only once. Return the ObjectInspector
* for the return value, given the rowInspector.
*/
public abstract ObjectInspector initialize(ObjectInspector rowInspector) throws HiveException;
/**
* Evaluate the expression given the row. This method should use the
* rowInspector passed in from initialize to inspect the row object. The
* return value will be inspected by the return value of initialize.
*/
public abstract Object evaluate(Object row) throws HiveException;
}
initialize中需要初始化ObjectInspector,返回輸出數(shù)據(jù)的ObjectInspector(它負(fù)責(zé)解析evaluate method返回的對象);而每次evaluate call傳進(jìn)來一條Object數(shù)據(jù),它的解析由ObjectInspector負(fù)責(zé)。
接下來是GenericUDF抽象類:
public abstract class GenericUDF {
/**
* A Defered Object allows us to do lazy-evaluation and short-circuiting.
* GenericUDF use DeferedObject to pass arguments.
*/
public static interface DeferredObject {
Object get() throws HiveException;
};
/**
* The constructor.
*/
public GenericUDF() {
}
/**
* Initialize this GenericUDF. This will be called once and only once per
* GenericUDF instance.
*
* @param arguments
* The ObjectInspector for the arguments
* @throws UDFArgumentException
* Thrown when arguments have wrong types, wrong length, etc.
* @return The ObjectInspector for the return value
*/
public abstract ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException;
/**
* Evaluate the GenericUDF with the arguments.
*
* @param arguments
* The arguments as DeferedObject, use DeferedObject.get() to get the
* actual argument Object. The Objects can be inspected by the
* ObjectInspectors passed in the initialize call.
* @return The
*/
public abstract Object evaluate(DeferredObject[] arguments)
throws HiveException;
/**
* Get the String to be displayed in explain.
*/
public abstract String getDisplayString(String[] children);
}
它的機制與evaluator非常類似,初始化中敲定ObjectInspector數(shù)組,它們負(fù)責(zé)解析輸入,返回output數(shù)據(jù)(即evaluator method返回的Object)的ObjectInspector;每次evaluate call傳進(jìn)一個Object數(shù)組,返回一條數(shù)據(jù)。
Hive支持LazySimple, LazyBinary,Thrift等不同的數(shù)據(jù)格式,同一個查詢計劃中,可以在operator上切換數(shù)據(jù)流的格式。比較常見的是在Mapper端使用LazySimpleSerDe,Mapper輸出的數(shù)據(jù)使用LazyBinarySerDe,因為binary格式比較節(jié)省空間,從而減少repartition時的網(wǎng)絡(luò)傳輸。 如果你想看查詢計劃的每一步到底使用了哪一種SerDe格式,只要用"Explain Extended"就可以查清楚了。
文章題目:Hive中的ObjectInspector設(shè)計
網(wǎng)站URL:http://www.ef60e0e.cn/article/pieidc.html