Tomas Petricek and I submitted a paper to the Programming Journal:
Baseline: Operation-Based Evolution and Versioning of Data arXiv, Playable demo.
We got good feedback from one reviewer but the other two were unsympathetic. We’ve been asked to make “major revisions”, which are mostly the usual demands for more “evidence” and “evaluation”. I’m not sure if we will revise or submit elsewhere. For those interested in how this game works I’ve attached the reviews below.
Review #15A
Overall merit
1. Accept
Reviewer expertise
4. Expert
Paper summary
This paper introduces Baseline, which is presented as a calculus for operation-based changes to data (for versioning purposes), but is also available as an interactive tool. Baseline supports records and lists, and is extended to include references to express relational database-like data structures (though the Baseline data model is more general, as it allows for nesting unlike relational data). The core functionality is to reorder operations to apply them to different branches/versions, but its utility is also explored in implementing other functionality.
Comments for authors
I really like the paper as an essay on the topic.
- The presented model is elegant, and the discussions and insights are interesting.
- The targeted problem is open and not well addressed.
- I particularly like section 5, which demonstrates how quite interesting features can be represented as the core operation.
- It includes an interactive tool to explore the concepts.
The main weakness of the paper for me seems to be that a lot of what they describe technically lacks strong novelty. It is the write-up, together with the interactive tool for exploration of the concepts, that I find most valuable.
I am a bit torn if I consider myself an Expert on this topic. I decided to be bold in the categorization, but I do lack knowledge about the current state-of-the-art in parts of this topic – I am most familiar with these issues from the view of distributed data structures.
More detailed comments/discussion
Many of these are discussion points. I would love to see some being expanded on in the final paper, but no strong requirements.
- I did try the interactive demo, but unfortunately at some point in Chapter 2 it told me I had gone off script. I still am quite happy that it was included, it communicates the idea in a way that the text cannot.
- “Operation[s typo?] use infix syntax” and “We execute operations with the ; infix operator”. I find this explanation quite confusing. Given the example: “S1 ; . append e”. What is the operation? The first quoted sentence implies that »append« is the operation, while the second quoted sentence implies it is ». append e«. It seems the term is overloaded, which I find quite confusing.
- “Actually Baseline only has values, with initial values serving as prototypes.” – I guess this works because you can uniquely compute the type from the prototype? It is also unclear why this is an advantage? The remark mentions this simplifies the programming experience, but … how? As far as I understand, this is hidden behind the type operations? You also say you explored an untyped approach, but at this point in the paper it is not even clear what the types do add – changing the structure of the data makes perfect sense, but what do types add for this system? Overall, I find this remark to just add more questions than it answers.
- End of 3.2 (page 9). I wonder about this part. You state that the number of operations has grown, thus the implementation becomes unwieldy. You also remark that the correctness of project/retract depends on the intent of the operations. My best guess is, that this intend is specific to an application and the operations it wants to support on the data structure. Which would likely imply, that by nature, one may need one bespoke operation for each semantic change – as opposed to composing more complex operations out of simpler ones. For example, a move cannot just be a delete+insert because the composition the project/retract of those smaller operations does not result in the desired semantics.
- Section 3.5: The comparison to git here seems to depend on a certain view of git that I do not think is universal. For example, you say that to branch an alternate version, it is sufficient to simply copy the document. This does work in Git as well though. Considering that a “Document” in Git corresponds to a file tree, when copying that whole file tree (including the .git folder) one obtains a perfectly functioning independent copy, with comparable history. While the UI problems with Git are well known I think the core conceptual model of git (commits pointing to the current version and the parent commits) is stupidly simple.
- Section 4.1: A typical problem with bidirectional schema migrations is that some information is lost (say, because a field was added). Thus, round tripping an actual data element through both transformations often results in a value being dropped and replaced by some default value, even though the project+retract should be a no-op. Any thoughts how this behaves in your system, and how it could potentially be addressed? Is there a safe subset of operations – it seems that the join/split database operations are safe for example?
- Related Work CRDTs: You state that you are “centralized” which I read as “your system makes decisions that require coordination” but I wonder what those might be? It seems that IDs are added only on inserts, which can be done as a local decision. Essentially, what I am asking is, what if one would just replicate the append-only logs that describe your system state? Including multiple histories?
- Related Work Bidirectional Transformations: I would like to see some comparison to your system, instead of just mentioning the related work (also relates to the question about bidirectional transformations I had above).
Review #15B
Overall merit
3. Major Revision
Reviewer expertise
2. Some familiarity
Paper summary
The submission describes Baseline, a platform for versioning data together with its description (schemas, models, etc.). Authors refer to a previous publication where the challenges were identified, and not that they solved a part of them. They do not observe only data states, but go deeper into the operational semantics and observe changes on this level.
Comments for authors
The paper describes a part of a solution for a significant problem. You refer to challenges identified in the previous study. However, I would expect, as a reader, to face these challenges with their explanations very early in the text, already in the introductory sections, and also to understand how and why these 4 were selected to be solved. What makes the other four different (e.g. more challenging and therefore left for later, or not challenging and left as a simple technical solution)? Maybe to explain shortly in the abstract too, as the reader is left “halfway” to get attracted. Related to the previous point, I do not consider true/false on these problems as an evaluation. I would expect to see some set of different scenarios, a change sample, etc., and then a real evaluation on them – namely, a justification of true/false evaluation.
The related work section is rather good and detailed, but I find its position after the conclusion a bit confusing. I also appreciate the opportunity to try the. I would suggest moving the link to it in the footnote so that it is more visible. Finally, one of my dilemmas is on the aim of the baseline: does it aim to be a platform to “record and observe” the versions of data, or does it have a higher goal to become an environment for changing the data while tracking its changes.
Review #15C
Overall merit
3. Major Revision
Reviewer expertise
3. Knowledgeable
Paper summary
This paper describes a method for managing the changes to structured data. The method is called operation-based as opposed to state-based. It tracks the sequence of operations applied to the data, and as opposed to observing only the state before and state after the changes.
The structure data supported consists of strings, numbers, lists, and records, where lists and records can be nested. The key operations are updates to data and types at a location accessed through a path, which is a sequence of IDs into possibly nested lists and records. The paper describes what it calls operational differencing, that projects and retracts update operations onto data. The paper then describes using these for version control, with what it calls diffing and transferring, adds more update operations for relational databases, and discusses using the update operations for queries, together addressing 4 of 8 challenges studied in their previous work.
Comments for authors
The transfer operation is key for the method to be practical, but it is not precisely defined. The intuition is clear, and the result on the example is obvious, but the detail of carrying it out at such a low-level is totally missing and is unclear how it could apply in general.
For use for relational databases, Figure 9 adds new update operations (4) needed over those operations (17 as I counted) in Figure 2. But DB is a special case of the general structured data, just as you commented at the end of page 14, which means, if the method is set up right, it shouldn’t need more/additional operations but should be the opposite. What’s more concerning is that even the additional operations seem hardly sufficient. For example, one could select some columns that are not consecutive, in which case the split operation that only divides all columns into two consecutive parts does not work. Indeed, the Note at the end of Page 14 mentions other (alternative) sets of operations having been tried.
Now using all those operations for database evolution (all page 14 except the Note at the bottom, and starting the last paragraph of page 12) seem very complicated. Why not just use traditional database updates? That would look much simpler.
Later, for what the paper calls operationalized queries, the idea is to essentially use these update operations to do even queries. This would be even more obviously low-level and procedural. However, the paper mentions this as potentially more general-purpose functional programming. The state-based approach would be much more declarative, with the simplest update operations needed. I think some practical support would be needed for such low-level programming.
In fact, for operationalized queries, yet more additional update operations are needed in Figure 10. Some evidence would be needed to support that this method with ever increasing update operations is indeed easier to use than extensively used state-based approaches.
Overall, the concepts described have been discussed in [28] (https://arxiv.org/pdf/2412.06269v1), and are argued to address 4 of the 8 challenges discussed in [28]. Addressing the other challenges would make this method even more complicated?
Administrator·9d
The reviewers would like the authors to make the following changes.
- Add evidence beyond single examples for the 4 problems addressed by the work
- Add a systematic discussion of what is necessary to support further operations, and what the limitations are
- Provide a deeper explanation of the conclusions from previous study
- Provide more details of the evaluation with a discussion
