Top

Training an AI Assistant to Provide Feedback on Writing

Rhetorical Analysis Annotation Instructi

exposure therapy updated annotation manu

Generative Research | Text Annotation Schemas | Data Labeling | ML Modeling | UX/AI Copy | Next Steps

Getting Started

I was a member of an artificial intelligence / machine learning (AI/ML) research and development team that had a number of design challenges with early-stage products that had Natural Language Processing (NLP) AI components. Utilizing my background in design, statistics, psychometrics, and psychology, I helped solve a number of design and data challenges related to designing and training an AI assistant that provides feedback on writing.

The AI writing feedback assistant was designed and developed to give formative feedback on higher-education student writing. The project included a number of steps:

Generative research; including, competitive analysis, literature reviews on effective writing feedback methods, and concept testing different types of writing feedback with instructors
Developing text annotation schemas for data collection and labeling using a text annotation tool
Training and monitoring subject matter expert (SEM) data labelers
Evaluating ML models developed using the training data
Generating writing feedback content (i.e., UX/AI copy)

Client

Pearson

Duration

4 months

Role on Project

AI research & design

Skills Demonstrated

Quant & qual research

Data collection & labeling

Product strategy

Concept testing

UX/AI copy

The Team

2 data scientists, ML engineering
5 data scientists, ML modeling
2 research engineers, front-end development
2 research engineers, back-end development
3 natural language processing (NLP) interns
1 VP AI products & solutions (background: cognitive psychology & CS
(me!) data scientist, AI research & design

Generative Research

Competitive Analysis

I conducted a competitive analysis on products and companies that have writing tools with various automated writing feedback capabilities. The analysis focused on writing feedback methods currently available on the market, and the design of that feedback. The following products were included in the analysis:

Turnitin
Grammarly
PEGwriting Scholar
Chegg's WriteLab + EasyBib
ETS's ScoreItNow! & e-rater (Writing Mentor)
Hemingway App
ecree
Cognii

Results of the competitive analysis revealed a number of opportunities for automated feedback in terms of high-prose feedback (e.g., feedback on content, thesis statements, argumentation, etc.) versus low-prose feedback (e.g., grammar, spelling, punctuation, etc.). The opportunity for high-prose feedback was evident for both prompt-specific writing (e.g., writing assignments on a specific topic) and prompt independent writing (i.e., writing not tied to any instructions).

Literature Review

A literature review was conducted to help determine what is effective writing feedback. Many of the studies suffered from a number of methodological constraints and had low external validity. However, one article stood out from the pack, Patchan et al. (2016) investigated how feedback features affected student implementation of writing feedback and the quality of their revisions. The researchers examined the following feedback characteristics:

Praise
Solutions
Localization
Focus of feedback
- Substance = feedback on content
- High-prose = evidence, argumentation, thesis statement, etc.
- Low-prose = grammar, word choice, etc.
Amount of feedback

Method/Procedure

The sample size was 7,500 feedback comments from 351 reviewers to 189 student authors. Students wrote a first draft, received peer feedback, and students were given 1-week to revise their draft based on the feedback. The reviewers were given a detailed grading rubric to use. The writing prompt was as follows: "Write a 3-pate paper evaluating whether MSNBC.com accurately reported a psychological study, applying concepts from the research Methods chapter."

IV: Feedback comments were coded for feedback characteristics.
DV: Implementation of feedback (yes vs no) and revision quality (no change or decrease in quality vs increase in quality) were coded.
Overall research question: What are the effects of praise, solutions, localization, focus of feedback, amount of feedback on the implementation of feedback and quality of student revisions?

Implementation Results

Only two feedback features increased the likelihood of implementation. First, if they received more than an average amount of praise, and second, when the comment was localized. Several feedback features were associated with a lower likelihood of implementation; specifically, mitigating praise, offered a solution, and focused on high-prose issues.

Revision Quality Results

Localized feedback was more likely to be implemented, but a student was less likely to improve the quality of the paper by implementing the feedback. High-prose feedback was less likely to be implemented, but a student was more likely to improve the quality of their paper by implementing the feedback that focused on high-prose versus low-prose issues.

Limitations

The rubric focused on high-rose and substance; hence, the findings that only high-prose and substance feedback were more likely to improve revision quality may have been a function of the rubric.
Neither writing ability nor reviewer ability predicted the likelihood of a student implementing a comment or the likelihood of a revision improving the quality of the document (which is surprising). But, 1) they didn't assess if the students understood the feedback or if it was easy to apply, 2) they didn't assess student motivation, and 3) they didn't assess student agreement with the feedback/viewed the feedback as credible.
The student didn't suggest an ideal length of feedback comments.
The study was missing important feedback features; for instance: 1) feedback that includes examples, 2) time of feedback (i.e., they didn't receive the feedback immediately, this could affect results), 3) order of feedback (i.e., what feedback is highest priority?) and 4) feedback that ties to a rubric dimension or refers back to the prompt.

Concept Test

I mocked-up a design concept of providing inline rhetorical analysis feedback (i.e., high-prose feedback) to writers. Six writing tutors provided feedback on this concept to help us understand if it would meet user wants, needs, and goals. Additionally, the concept test helped us determine whether the value proposition of a rhetorical analysis feedback approach was significant enough for potential users to justify its further development.

Concept Test Results

Results

Text Annotation Schema

Data Labeling

Text Annotation Schema

Schema Definitions

A schema is used to create quality datasets to train NLP models. A number of schemas associated with different writing prompts were developed. For one of the schemas, the schema defined a set of key concepts that represented a core idea or assertion an instructor would expect to see in student response to a particular writing prompt. Each key concept had a corresponding list of target aspects that a response must have had in order for it to be considered correct.

Exposure Therapy Annotation Manual Lee V

Data Labeling

Text Annotation & Rating Task

Three subject matter experts (SMEs), including me, the expert adjudicator, were used to annotate student text and rate text annotation spans using the schema as a guide.

Writing Prompt Topic: A student is experiencing agoraphobia and a psychologist uses exposure therapy (i.e., graduate exposure and flooding) to treat it.

Example Key Concept Text Annotation: The definition of exposure therapy

Target Aspects:

Exposure therapy is a psychological treatment that was developed to help people confront their fears
In this form of therapy, psychologists create a safe environment in which to "expose" individuals to the things they fear and avoid.

Key Concept Text Annotation Span Ratings:

Complete: The student correctly addressed all of the target aspects
Partial: The student attempts to address the key concept, but does not cover all of the target aspects
Incorrect: The student attempts to address the key concept, but presents incorrect or wrong information
Omitted: The student does not address or mention the key concept in the essay

Screen Shot 2020-03-16 at 9.35.39 AM cop

Text annotation and rating data collection tool.

Overall Rating Distribution

Data labelers A and B spent on average 5 minutes per 350 student essays, the adjudicator took 1 minute per essay.

Screen Shot 2021-01-21 at 12.43.08 PM.pn

Agreement Statistics

Screen Shot 2021-01-21 at 12.43.48 PM.pn

Rating agreement: exact agreement

0 = No agreement

.01 - .20 = None to slight

.21 - .21 = Fair

.41 - .60 = Moderate

.61 - .80 = Substantial

.81 - 1.0 = Almost perfect

Screen Shot 2021-01-21 at 12.43.59 PM.pn

Text span agreement: F1

A good F1 score is low false positives and low false negatives.

Screen Shot 2021-01-21 at 12.58.24 PM.pn

Schema Iteration

Overall, results indicated high agreement in identifying text spans that addressed a key concept; however, ratings for coverage and correctness of key concepts were less reliable. Through the adjudication process, I iterated the schema in an effort to more efficiently and accurately label data in the future. This included changing the rating scale from "Complete, Partial, Incorrect, Omitted", to "Complete, Not complete, and Omitted".

To evaluate the feasibility of key concept detection in the sample essays, pilot experiments were conducted using the annotated and rated corpus. Systems were trained to mimic the annotator behaviors in both the localization and rating tasks. An existing holistic score for content was used to stratify the corpus, which we then used to divide the 344 essays into a training set (258) and test set (86). This stratification was used to ensure that the distribution of key concepts ratings in training and holdout were similar. For a full description of the analyses see the link to the unpublished manuscript below.

Refining Fine-grained Content Assessment

Submitted author order: Budden, Becker, Wiemerslage, Derr, Rosenstein, Baikadi, Hellman, Murray, Bradford, Burkhart, Farnham, Foltz, Gorman, Roccatagliata

ML Modeling

UX/AI Copy

Feedback Statements

For the predicted text annotation spans, a paired feedback comment will be needed to be displayed to students. A template-driven feedback comment generation was used. Template-driven generation consists of pairing points on the rating scale with predefined feedback comments or template strings for each key concept. Twenty essays from each point on the rating scale were randomly sampled (i.e., Complete, Partial, Incorrect, Omitted). I read the essays and wrote a set of feedback comments that would be valid across each sample and that could be automatically surfaced based on the rating model's classification of how well the essay addressed each key concept. Partial and incorrect feedback comments were similar and collapsed into a "Not Complete" category to correspond with how the data was modeled. Example feedback statements were as follows for each classification:

Complete: Nice job, your definition of Exposure Therapy looks good!

Not Complete: It looks like your essay could be improved by working on your definition of ExposureTherapy

Omitted: It looks like your essay is missing the definition of Exposure Therapy.

Authoring Guidelines

Using this feedback generation method for another prompt, where student responses are not yet available, the templates can be authored directly from the key concept descriptions. While this method provides a mechanism for delivering feedback, evaluation of the effectiveness of feedback will be performed in future user studies.

Keep a balance of positive and negative feedback
- praise is motivating, criticism is demotivating
- Keep the "negative" feedback nice and approachable
- Avoid mitigating negative feedback with positive feedback within one comment
Students need to be able to understand the feedback in order to implement it
- When using "can you elaborate more..." or "deepen your analysis..." include examples or additional help
- Keep the feedback brief
- If possible, refer back to the writing prompt and rubric - students want to know if they are answering the prompt correctly
For high-prose (e.g., content) feedback, keep it to 3 - 4 feedback comments
Localize feedback

Next Steps

The next steps include conducting a study that evaluates the quality and effectiveness of AI-provided high-prose feedback in higher education settings. Additionally, practical take-aways will be applied to future data collection and labeling efforts. For instance, how to best specify the annotation and rating schema, and how to best train data labelers. The ML models produced from a project like this are only as good as the data used to train them.

Training an AI Assistant to Provide Feedback on Writing

Generative Research | Text Annotation Schemas | Data Labeling | ML Modeling | UX/AI Copy | Next Steps

Getting Started

Client

Pearson

​

Duration

4 months

​

Role on Project

AI research & design

​

Skills Demonstrated

The Team

Generative Research

Competitive Analysis

Literature Review

Method/Procedure

Implementation Results

Concept Test

Results

Text Annotation Schema

Schema Definitions

Data Labeling

Text Annotation & Rating Task

Overall Rating Distribution

Agreement Statistics

Schema Iteration

ML Modeling

UX/AI Copy

Feedback Statements

Authoring Guidelines

Next Steps