April 8, 2026·8 min read

How Crazy is Our Work: Pictures, Words, and the Multi-Modal Madness of Student Submissions

By CEO, GradeUS

Welcome to the multi-modal world of grading, where everything is unstructured and behaves like a random process that simply refuses to be modeled.

If you were to look at my desk (or my downloads folder) right now, you would see a beautiful, chaotic spectrum of human effort. We receive simple text files, heavily formatted word-processor documents, hastily snapped photos of handwritten paper, and sprawling digital canvases. Evaluating this mixed bag is a monumental task. For an AI-based evaluation system, parsing this chaos is as complicated as trying to separate a counterfeit masterpiece from the real deal in a dimly lit room.

“A picture is worth a thousand words.”
That is a wonderful sentiment for an art gallery. But what if your literal job is to extract a thousand specific, highly technical words from that picture?

The Missing Pixels and the Human Supercomputer

Extracting text from an image is a challenging but historically solvable problem. However, add the complexities of the modern digital classroom, and the difficulty scales exponentially. We are at the mercy of resolution, lighting, file compression, and the sheer quality of the image. The camera angle of a student's smartphone fundamentally changes the nature of the world we are trying to see and evaluate.

Here is the ultimate “aha!” moment when thinking about grading: Human assessment is basically an incredibly advanced biological pattern recognition system. When we look at a poorly lit, low-resolution photo of a student's fluid dynamics homework, our brains instantly go to work. We automatically fill in the missing pixels. We smooth out the smudged graphite. Just like our natural ability to recognize a friend's face peering through a crowd or a partially obscured window, we can recognize a correct equation even if half the denominator is cut off by the edge of the page.

We don't just read; we interpolate. Generative AI, for all its brilliance, still struggles to squint its digital eyes and guess what the student meant to write.

The Reality of the “Mixed Format”

In a perfect world, a well-informed, pre-set rubric dictates the format, and everyone complies. In reality, students use the technology they have access to. Creating beautifully organized, sequentially perfect digital submissions requires high-quality resources, software, and time—luxuries not every student has.

Because of this, we get multi-modal inputs. A student might submit a PDF where page one is typed, page two is a photograph of a hand-drawn graph, and page three is a screenshot of some code. This mixed-format reality is an absolute nightmare for a linear AI agent. When content is disorganized, lacking any specific sequential order, the AI gets dizzy trying to figure out which way is up.

The Double Duty of Diagrams

The complexity doubles in assignments where illustrations are just as critical as the text itself. Consider a detailed engineering design or a complex data visualization. The student draws a diagram, and beneath it, they write a paragraph explaining the mechanics.

For an AI system, it is no longer just about reading text. The system has to:

Clearly separate the text from the image.
Understand the technical merits of the illustration on its own.
Understand the text on its own.
Link the two together to determine if the student's words actually match their drawing.

It is like asking a machine to watch a silent movie, read the script separately, and then write a review on how well the actors delivered their lines.

“Out of clutter, find simplicity. From discord, find harmony.”
— Albert Einstein

This quote captures the exact challenge we are throwing at our technology. We are handing over the clutter and the discord of the multi-modal classroom and demanding harmony in the form of a fair, objective grade.

It is a tall order. But the good news? Modern Generative AI models are rapidly evolving to address these exact challenges. We are teaching the machines how to squint, how to interpolate, and how to read the room. The road ahead is crazy, unstructured, and entirely multi-modal—and I wouldn't have it any other way.

Use of Generative AI in refining my initial thoughts in this blog is acknowledged.

← Back to all posts