I interviewed Tina Groves who discussed Variety in Big Data, Specifically Around Text Analytics.
It’s great to speak with you again, Tina. It’s been about a year since we’ve talked. Today I’m looking forward to hearing your views on the topic of variety in big data, specifically around text analytics. Before we start, can you provide a brief background of yourself?
Yes, thank you, Dustin, it’s a pleasure to speak with you again. I’m Tina Groves; I’m a senior product manager with IBM. I work in product strategy in the Big data and analytics area; I’ve been working in this area for a little over two years at this point. Last spring I shifted my focus specifically to look at text analytics.
Thank you. My first question regarding this topic: What’s the latest buzz in big data?
The term has been used so often now, but many people just say the world arise, but it does become a little hard to keep up with everyone claiming they’re doing something in big data. The three trends I’ve seen that are about solving the problems with addressing not just very large volumes of data, but how to achieve the analytics experience that we’ve enjoyed for several decades now with relational databases and also how to tackle the types of data that can be stored in Hadoop and the old SQL databases. What we’re seeing is a great deal of innovation around dynamic SQL or interactive SQL so that one can put the traditional and analytical tools on top of these new forms of data storage.
You can see there are a lot of concerns around security and privacy for companies that have gone past the pilot stage, and they’re looking at, they want to share data as a data aggregator or data collector. Lastly, you see a lot of innovation around tools that are around data wrangling. We’ve moved past the data scientist who’s analyzing the data and looking at how we can take all this messy data, data that’s normally not joined together and how we can bring it together in a way that creates that experience that we’re used to seeing on the relational database side. In data wrangling, a term that was coined largely by Stanford University, and you can see it in products like Trifacta and Paxata claiming to address this space. One of the areas that’s very common is how to transform text in typical rows and columns that you see in a traditional environment so that you can use it in an analytical model.
In supply chain there’s all kinds of paper and e-mail and PDF attachments. How can this be turned into data?
That’s a great question, Dustin. In any kind of environment, especially one where you’re taking data from, let’s say, in the middle of a logistics chain. How do you manage all this data? In some cases over the past decade, I’ve seen many clients barcoding the information; they just basically use a barcode. That helps with a lot of the structured data—if you’ve scanned a piece of paper and it’s been transformed into its piece parts; sometimes there’s digital interchange.
The tough part is when somebody’s typed in a note or there are special instructions or there’s something in the e-mail that’s related to that PDF attachment, and that’s where the text analytics really comes into play. Nowadays, what you’ll see is more embedded technology to help transform those special instructions into instructions that can then be tracked, for example, in a supply chain context. How does this turn into data? Well, I think that’s going to require some more questions because it’s a very complex area.
Why is analyzing text so hard?
Well, I’ve been learning from the scientists I work with at IBM, who specialize in natural language processing, and I have a greater appreciation today than what I had five, six months back. Analyzing text where you can just do straight word matching, that’s really text parsing, and that’s not that hard. If I see the word Tina in a piece of text and Tina represents a person, then we can have a person field with my name, Tina, in it. Or detecting, for example, the names of cities and locations or countries can be easily done.
What becomes more difficult is, let’s say you’re analyzing customer-support records, which could happen in a supply chain; calls that are coming in about people complaining. How can you infer, for example, the sentiment that that customer has? Are they a little angry, somewhat angry, somewhat complacent? How would you then use that mood or that person’s sentiment as an input into your decision-making? And then combine that, for example, with is this the first time this person’s called in, the second time, the third time? And what’s on the line? Is there a parcel that’s worth a few bucks, or is this a parcel that represents a birthday present for someone very close to the sender?
In those cases you want to be able to combine both qualitative and quantitative information, and that’s where we get into that moving from just text parsing and detecting specific words into understanding the perception of the person. What is their awareness? What is their opinion? I’ve already mentioned sentiment. That really requires more skill and linguistics and language use than normal text detection or text recognition.
What technologies can supply chain consider to address this challenge?
That’s another great question. In the big data space, this area’s evolving very quickly, so three months from now, the answer may change a little bit. If you’re using Hadoop, any of the technologies today tend to be very developer-oriented. You’ll see Python scripts being developed, leveraging the natural language processing toolkits, which is a very costly, very time-consuming way to address text analytics, but it is an area that’s evolving very quickly.
If you’re dealing with a lot of documents—we were talking earlier about e-mail and e-mail with PDF attachments—there is an area around content management called content analytics, which can facilitate pulling out the information more quickly, and many of those technologies will have additional toolkits or what they might call annotators or extractors to infer opinion, sentiment, or awareness factors. A third type of technology is around natural language-processing toolkits, like you see from basis technology or an open-source one called Gate.
IBM also has, of course, these technologies but because of difficulty in using them and the developer focus, we tend to bundle them in our products so that it’s hidden and, basically, it’s one of the engines that we use to drive people’s requests, their interaction with the information. That’s three. Just to recap, the developer-oriented technologies like our Python toolkits; there’s the application that you’d view with content analytics; and the third one is kind of this middle ground in the data-wrangling space where the natural language-processing engines are embedded as part of user interface.
What do you see in the future?
This is another good question. In the future, this area is just so hard to do. Once you get past the basics and the amount of skill that’s needed both in, some cases, statistical backgrounds if you’re going to clustering on words, if you’re going to understand, for example, how this turns into fraud. I have one customer who is looking at automating their accounts payable and this was Memorial Hospital.
After they finished basically automating this process, they found all kinds of unusual behavior. The behavior was suspicious enough some contracts being handled that they engaged the FBI to investigate. At the end of their investigation, the FBI successfully prosecuted several employees for bribery and some vendors that the hospital was dealing with for bid rigging. The outcome of that investigation was that because this area is hard, Memorial Hospital actually created an application that they now resell to other hospital systems that are tackling a similar problem.
I see that the trend will be to encapsulate the algorithms needed to analyze text, so it wouldn’t be just generics in that sentiment; it’ll be sentiment for supply chain vendors or, even more specifically, sentiment for vendors that are in logistics planning, and sentiment in the terms for people who are in the manufacturing area. People who do warranty analysis, for example, are looking for the shortest time to resolve so they can obtain. In the future, I see that the developer tools will eventually become less important, and this type of technology will be baked into the applications.
Thanks, Tina, for sharing today.
Thank you, Dustin, for the opportunity to speak with you and your audience again.
Another link which explains the technology a little more: Why is analyzing text so hard? | The Big Data Hub
About Tina Groves
Big Data and Analytics Product Strategy at IBM