Helping users in data analysis through a hierarchical verbalised data description.
Data analysis is a task both time-consuming and high in expertise requirements for statistics and IT. Users mostly have to specify their analytical intents in a software-specific syntax or UI and then need to interpret the results. This leads to the current high demand for experts in the analytics sector, as the demands cannot easily be met by anyone. The emerging question is: How can we help everyday users who have neither the time nor the statistical and IT expertise to analyse data? There are several approaches, but most of them involve easing the way towards visual analytics. To open up a completely new perspective: Why not use verbal encoding instead of abstract visual encoding to describe and understand data sets? An example of this is given in Figure 1.
Note: This approach is not targeted at more advanced analytics like model-building approaches and is more about describing and interpreting given data.
The benefits of abstract visual encodings - and why we may not need it to the current extend anymore
Visual encodings have a long history, from simple tally sheets in the ancient work to complex modern dependency visualizations. Why they work so good is very straightforward: They provide information in an aggregated way and build on our best perceptional system - the visual system. That one data point is larger than the other is pretty intuitive in a diagram, whilst even simple numbers require us to reason with them abstractly, which claims quite much cognitive capacity.
But visual encodings stem from a time with the premise that people had to interpret and reason with the data themselves and therefore supported this the best way possible. Today's computers with their immense computing power and intelligent models could very well do the interpreting and reasoning for us.
What are the main obstacles?
There are several emerging questions for this approach, which is discussed in the following section. First and foremost, data analysis builds on two core concepts - pattern recognition and semantic knowledge. Whilst pattern recognition enables us to see a trend or to characterize a distribution and is basically statistical understand on differing complexity levels, semantic knowledge tells us that calculating a trend in a list of existing postal codes or characterizing the distribution of street names by the number of characters is probably not a useful analysis. The pattern recognition is easy. Every serious statistical package is able to calculate averages, deviations, regressions or other statistical metrics which make up patterns. Semantic knowledge on the other hand is harder to come by as it requires a lot of prior experience. Take a look at the following list of numbers:
At this point you probably figured out that these are German postal codes. But how should an algorithm understand that these are not numbers, but nominal “titles” of geographical regions? Much data has to be identified in it's context - guessing a number is a postal code if it is surrounded by other address data is not that hard. Distributions are also handy: If the median is 46 and the range is 0 to 100 guessing age as the dimension is not that far off. There are currently a lot of research groups taking a look at it, but to my knowledge no one has released an algorithm / database yet.
Another very interesting question is: How could a computer generate descriptions of data that match the desired granularity of the user? For a data set with the business volumes of different business units for the last 15 years, should a statement be generated about the total business volume or for just one business unit during the last 5 years? The answer to this question may be two-fold and has a side-mark. Firstly, we could respond with not generating one description, but several descriptions in a hierarchical, collapsable structure. This way the users could really explore the data set like they would with a visualization: From the overview down to specific sections. Secondly, asking the system questions should be the next logical step. In order to do so, the system would need to understand and derive queries from natural language. An interesting approach in this direction was recently launched by Google, who integrated a natural language query function into their spreadsheet application. The side-note: Although verbalized statements may greatly benefit the data analysis, they mostly wouldn’t eliminate the need for an actual data representation like a table or a complementary visualization with the values labeled.
Of course the remaining question at this point is: How would the verbalized statements be generated? Given that the structure of a statement about a data set is mostly built around a variable description with either a data point or an aggregation of data points with an optional filter remark, the typical statement about data could be relatively easily generated on a rule base. An example could be: The business volume of dishwashers increased by 12% over the last 10 years. Or: 45 Women at TestCompany earned more than 100.000$. Although these statements have the referred building blocks, they do use some more inconspicuous semantic knowledge. That “over the last 10 years” needs the context of the current year and reference that to the data set in question or that the verb “earn” is mainly used in context with income indicates that even though the basic statement structure is easy, generating good statements may also need quite a lot of knowledge.
Of course, when doing it right, one would need to train a NLP (Natrual Language Processing) model to make statements about data which was trained by experts with a wide variety of data contexts.
Where could it be useful?
So we pointed out the general intent of helping (inexperienced) users understand data sets faster. But what are concrete application that could benefit from this? Several use cases emerge:
Hierarchical verbalized descriptions could make data analytics platforms more beginner- and management-friendly. Examining a data set without getting beaten to death with complex and detailed visualizations and the need to specify queries (which is time-consuming and in most apps not intuitive for beginners) could remove entry barriers.
Automatic descriptions of data sets would make datasets accessible for blind people - in every application that deals with data.
Automatic descriptions are a step towards automated reporting. Whilst data visualization recommendation is very up and coming, the text for reportings is mostly written by analysts. Imagine your quarterly management report would automatically generate not only a fitting visualization but also write a title for each slide and would draft the bullet points!