I submitted this abstract to PPIG’20 but have just withdrawn it. I’ve been working on this idea for the last 2 years. But lockdown has given me time to think about my research goals and it no longer seems like the right thing to be doing right now. It just isn’t going to make a dent. Leaving the abstract here as a marker.
Data wrangling is a crucible for end-user programming. Data scientists must clean, transform, and integrate data before they can do analysis. “‘data wrangling’ often constitutes the most tedious and time-consuming aspect of analysis”[Kandel11a]. It is worse for non-programmers because data wrangling is ad hoc and cannot be done by recipe. We take this as a research challenge: can we make data wrangling as easy as “spreadsheeting” for non-programmers?
Subtext takes a foundational approach to this problem, co-designing a language and environment from scratch to bridge Norman’s two gulfs [Norman88]. Subtext helps bridge the Gulf of Evaluation by making all program execution fully visible, not as a tacked-on visualization, but incorporated into the representation of programs themselves. Programs are executions.
Norman’s other gulf, that of Execution, is at the heart of the data wrangling problem: how can a non-programmer know what to do to transform their data? Mainstream data science platforms like Jupyter require code merely to change a single datum! Our answer is to provide a direct-manipulation environment where data can be imported, edited, and transformed using a small and systematic set of affordances. These manual manipulations are recorded as reusable and customizable scripts. Crucially, the Subtext language is designed so that it can naturally represent such recordings (unlike the convoluted code generated by standard “macro recorders”). The design goal is that any Subtext program can be understood as a series of operations that could be performed manually in the UI, and vice-versa.
This demonstration will focus on the hard sub-problem of querying, and its black heart: relational join. Even in the most enlightened tools like Wrangler[Kandel11b], joins are still an abstract concept shrouded in mystery (will that be an inner or outer join?). We propose denormalizing the relational model to support nested tables and explicit binary relationships in such a way that joins decompose into simpler operations corresponding to direct manipulations in the UI. Recording such manipulations produces a reusable query without having to learn a query language or understand relational algebra, which we hope will bridge the Gulf of Execution for queries. We call this Query by Manipulation, as opposed to prior research on Query by Example which relies on the computer being smart enough to infer your query, and traditional query languages, which rely on you being smart enough.