Skip to content

Fix for ProtobufStructObjectInspector to work correctly with Hive#400

Open
rahulrv1980 wants to merge 4 commits into
twitter:masterfrom
rahulrv1980:master
Open

Fix for ProtobufStructObjectInspector to work correctly with Hive#400
rahulrv1980 wants to merge 4 commits into
twitter:masterfrom
rahulrv1980:master

Conversation

@rahulrv1980

Copy link
Copy Markdown

Use builder class since hive expects mutable objects while message objects are immutable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you completely changed the formatting? I'm guessing it was your IDE?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It was. Fixed

@mikebern

mikebern commented Jun 6, 2014

Copy link
Copy Markdown

Hi Rahul,

I am getting an exception with your fix. I have compiled elephant-bird for hadoop2 and bumped protobuf to 2.5.0. I will try to get more debug info next week.

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
... 8 more
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to com.google.protobuf.Message$Builder
at com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector.setStructFieldData(ProtobufStructObjectInspector.java:159)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:396)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:152)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$300(MapOperator.java:125)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:488)
... 9 more

@rahulrv1980

Copy link
Copy Markdown
Author

I think this is likely because of a repeated field in your schema. We don't have any, so, I did not do much testing around that. There were existing unit tests which dealt with repeated fields, and I essentially got them to pass. I can take a look at it later this week

@rahulrv1980

Copy link
Copy Markdown
Author

I think I have a fix. Let me know if you are ready to try it

@mikebern

mikebern commented Jun 9, 2014

Copy link
Copy Markdown

Rahul, thanks a lot. I'll test it first thing tomorrow morning.

@rahulrv1980

Copy link
Copy Markdown
Author

Mike, please try the code in the branch : https://github.com/rahulrv1980/elephant-bird/tree/RepeatedFieldFix. Also, if it does not work, could you write a test case for the failure and I can look to fix it

@mikebern

mikebern commented Jun 9, 2014

Copy link
Copy Markdown

Rahul, great - the previous exception is gone, but I am seeing another one:

Caused by: java.lang.ClassCastException: Logentries$Impression$NestedMessageType cannot be cast to com.google.protobuf.Message$Builder
at com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector.getStructFieldData(ProtobufStructObjectInspector.java:193)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:394)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:337)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:395)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:152)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$300(MapOperator.java:125)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:488)

NestedMessageType is the type of first non-empty repeated nested message in the Impression message.

Let me know if I can help fixing this.

@rahulrv1980

Copy link
Copy Markdown
Author

Could you provide a test case with your message definition with the failures. There are existing unit tests for ProtobufStructObjectInspector. You could modify an existing one or better still, add a new one mimicking your use case with protocol buffers and getting a repro.

@mikebern

Copy link
Copy Markdown

Hi Rahul, I have created a patch that does not cause exceptions:
https://gist.github.com/mikebern/825812f0de95a2f03dd5

I will need to verify that the data is correct. I'll try to add unit tests after that.

There seem to be a problem though - I am seeing a huge performance degradation. We have patched Hive to assume that the table Serde and partition Serdes are the same. This way EB works for us on the partitioned tables. If I use your branch and the unpatched Hive, it seems, that Hive tries to convert each row to the table Serde and that causes huge performance loss - I wasn't even able to run select count on 30 sec of our data. I'll post more details later.

@rahulrv1980

Copy link
Copy Markdown
Author

How are you validating that the change caused the performance degradation? Did the ProtobufStructObjectInspector without my change work at higher performance (Not sure how the old version worked at all). Could you try this on your patched Hive? I am making my first venture into Hive and so, I don't have a baseline to compare performance

@mikebern

Copy link
Copy Markdown

This is not a bug in your code. You code enables a use case that didn't work before at all with the stock Hive. Specifically, it allows to run queries on Hive tables with partitions. I am still digging into this but here is what I think is going on.

The gist of the performance problem is that Hive looks at each partition Serde separately and tries to find a way to convert partition rows to what it perceives to be the overall table Serde (AFAIK, table serde can be specified explicitly, or, if not, is taken from the first(?) partition).

Under debugger, I see that

hive-0.12.0-cdh5.0.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorConverters.java:147

if (inputOI.equals(outputOI)) {
return new IdentityConverter();
}

check is false, even though both inputOI and outputOI are of the same com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector type and represent exactly the same schema. This is because ProtobufStructObjectInspector does not override equals/hashCode methods.

Because of the failed check, Hive sees that inputOI and outputOI are different and creates a converter from the former to the latter using your code. It then applies it to every row in the table, which turns out to be a very expensive operation both in execution time and memory.

My change simply monkey patches Hive to always think that inputIO and outputIO are the same object, and with it I have a reasonable performance:

MapOperator.java : 200

if (opCtx.tblRawRowObjectInspector.getClass().getName() != "com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector") {
opCtx.partTblObjectInspectorConverter = ObjectInspectorConverters.getConverter(
partRawRowObjectInspector, opCtx.tblRawRowObjectInspector);
} else {
opCtx.partTblObjectInspectorConverter = ObjectInspectorConverters.getConverter(
opCtx.tblRawRowObjectInspector, opCtx.tblRawRowObjectInspector);
}

same in FetchOperator.java: 400

if (currTbl != null || serde.getClass().getName() == "com.twitter.elephantbird.hive.serde.ProtobufDeserializer") {
tblSerde = serde;
}
else {
tblSerde = currPart.getTableDesc().getDeserializerClass().newInstance();
tblSerde.initialize(job, currPart.getTableDesc().getProperties());
}

So, to complete your patch we probably need to override equals on the Protobuf inspectors. Let me know if you would have time for this (in case, of course, my assessment is correct). If not, I can do it myself but it will take me longer, since I don't understand the code that well yet.

I will also add more unit tests by the end of the week.

@rahulrv1980

Copy link
Copy Markdown
Author

Got it. Could you try adding this equals() method?

@OverRide
public boolean equals(Object o) {
if (!(o instanceof ProtobufStructObjectInspector)) {
return false;
}
ProtobufStructObjectInspector that = (ProtobufStructObjectInspector)o;
return (descriptor.equals(that.descriptor) && builder.equals(that.builder));
}

@rahulrv1980

Copy link
Copy Markdown
Author

I have added your fix + the equals() method change to the branch at https://github.com/rahulrv1980/elephant-bird/tree/RepeatedFieldFix. Let me know if it works

@mikebern

Copy link
Copy Markdown

Thanks, Rahul. I will try tomorrow morning.

@mikebern

Copy link
Copy Markdown

I modified the code slightly:

public boolean equals(Object o) {
if (!(o instanceof ProtobufStructObjectInspector)) {
return false;
}
ProtobufStructObjectInspector that = (ProtobufStructObjectInspector)o;
boolean res = descriptor.equals(that.descriptor);
boolean res2 = builder.equals(that.builder);
return res && res2;
}

@OverRide
public int hashCode() {
return descriptor.hashCode() + 7*builder.hashCode();
}

and I see that descriptor.equals(that.descriptor) is true, but builder.equals(that.builder) is false. Do you think that the latter check is essential?.

@rahulrv1980

Copy link
Copy Markdown
Author

I suppose the descriptor alone may be sufficient. But, it is strange that builder.equals(that.builder) is returning false. Could you share what the builder objects are?

@mikebern

Copy link
Copy Markdown

Sorry, was busy with getting Spark/Shark working with EB. I will send an update tomorrow.

@mikebern

Copy link
Copy Markdown

I check the descriptor - in all calls to the equal method the compared objects are actually the same object (and different for the Builder objects). Looks like Descriptor class does not override equals method, so it works only because it's the same object. I tried adding another partition and I see an exception. Not sure if it's related or not.

Ironically, Shark works ok across multiple partitions.

@rahulrv1980

Copy link
Copy Markdown
Author

Could you dump the different builder objects?

@rahulrv1980

Copy link
Copy Markdown
Author

@mikebern ping?

@rahulrv1980

Copy link
Copy Markdown
Author

You could try adding below to the proto file:
option java_generate_equals_and_hash = true;

@jkm137

jkm137 commented Nov 6, 2014

Copy link
Copy Markdown

I was hitting this issue, so I cloned and built your changes (https://github.com/rahulrv1980/elephant-bird). With a simple table of AddressBooks (com.twitter.elephantbird.examples.proto.AddressBookProtos$AddressBook), now I'm hitting:

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
... 8 more
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to com.google.protobuf.Message$Builder
at com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector.setStructFieldData(ProtobufStructObjectInspector.java:159)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:396)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:154)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:127)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:509)
... 9 more

Thoughts?

Thanks in advance!

@CLAassistant

CLAassistant commented Jul 18, 2019

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Rahul Ravindran seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants