# Comments and Suggestions

I think that there is a major issue that needs to be fixed in our assessment strategy. This is an assessment- wide issue, though I will describe it specifically for CS146 outcomes (a) and (j). Currently, we have 4 indicators for outcome A, and 3 for outcome J, and they are each taken independently, almost as sub-CLOs. I don’t think that is the proper measurement granularity. Instead, the set of 4 (or 3) indicators should be used, together, to measure each student’s level within the CLO.

The current measurement scheme seems to have, as a goal, each indicator question being “equivalently hard”. It makes the mistake of assuming that whether or not students answer questions correctly or not is independent over each question, where the percentage chance of getting a question correct varies somewhat smoothly for different student abilities. That is, you might think that, given a set of even difficulty questions, perhaps an A student answers each correctly with 90% probability, a B student with 80%, and a C student with 70%. (The percentages can be different than these, but the main assumption is that, for each student, the questions are all pretty much independent.) Then, with a set of questions of this same difficulty, it is fairly easy to distinguish between different student levels.

Unfortunately, a fairly different performance likely applies just as much: at a certain problem difficulty, any given student will get some large percentage correct (say 90%, where again, he exact percentages aren’t the important part), and then, as the problem gets harder, at some point there will be a pretty quick transition to where that same student gets a large percentage incorrect (say 90% again). The difference in A vs. B vs. C students is now measured by how difficult the questions can be before that switch in performance occurs for each.

If, in the 2nd model, the step-like functions for each student type (A, B, C) are fairly steep, the 2nd model and the first model don’t play particularly well together. Much more likely is that each model roughly holds to some extent, in a way that doesn’t match the other, and that actual student performance is some mix between these two models. (Other models are in the mix as well, especially the one of good students learning more topics than poor ones, especially if the Mastery Method is used.) This explains why, in my courses, I give a mix of problem difficulties: I think there is otherwise a real difficulty in properly distinguishing between student performance, which is an important part of our job (probably second only to actually teaching). That is, if you could somehow pick a problem difficulty so that a borderline C+/B- student had a 50% chance of answering each question correctly (you hit dead-center on their step-function transition), from those questions, A and B level students would be hard to distinguish from each other: each might have 80-90% of problems correct, but we are only using a few questions to measure the CLO. While we might be able to distinguish between those students and the C+/B- student, the gap between A and B would be difficult to distinguish, even though some might be those we recommend for graduate school, while others are not. Similarly, the gap between a C and a D student might also be hard to tell, with each getting 10-20% of the problems right. Yet that is a grade difference that also needs to be carefully explored, as some pass the course, and some don’t. (And, of course, designing problems which hit dead-center on that step-function would be very difficult.)

With a mix of problem difficulties, it becomes much easier to properly distinguish different student levels. If there are 2 basic and 2 advanced indicators for an outcome, and the basic questions are targeted at a C student, then if a student answers one basic question fully, and one partially, that may qualify as satisfactory. If they answer both basic questions correctly, and have one advanced correct (or both partial), that may qualify as exemplary. While making data collection more difficult (indicators cannot be assessed independently from each other), this will allow for an easier normalization between professors, as well as between different sections taught by one professor. (For one of my indicators, one section had a difficult question; the other section had an easy question. Their performance on that indicator alone was very different, while being similar on others. I suspect that the difference is due to the question, instead of being due to any marked degradation in my teaching between 9 and 10:30 sections.)

While I appreciate the “normalization” that the indicators are supposed to give us, that rigidity tends to lead to one of two things: either the test is written with a high percentage of questions targeted to be of the exact form specified by the indicators, or tests are written that don’t quite map perfectly to the indicators. Instead, if the indicator questions are taken as 4 examples of questions targeted towards an outcome, but you can pick your own if they match your exam/assignment better, as long as you then combine multiple indicators over multiple difficulty levels to get ratings for each student over the CLO, you are fine. It would make for a much easier comparison over semesters and professors. (It will still be hard to compare over professors, but it would be easier.)

In particular, in comparing results to last spring, just looking at individual indicators, it will be very difficult for our numbers to improve. I didn’t teach last spring, and I tend to give exams with significantly lower averages than my colleagues, mixing a combination of “vanilla” problems (which might look similar to other instructor’s questions), to more “interesting” (challenging) ones (which are probably not so similar). On my standard questions, it is reasonable for a large percentage of the class to get the answer correct. On the challenging ones, there is no chance. You can see this very clearly in comparing the first two indicators for outcome (a): in the first, 42 out of 50 passing students failed to answer the indicator at a satisfactory level. For the second, 44 answered it at an exemplary level. In the latter case, the problem was stated in a very straightforward way: here is a recurrence equation, use the Master Theorem to solve it. In the first problem it was much harder: here is some recursive pseudocode, calculate its runtime, but there is some obfuscation: the program itself calculates a recursive function, but not the same one as its own runtime. It is the type of problem where, if shown a similar example ahead of time, it becomes easy. But here, they have not seen any similar problem, and it confuses all but the best students.

As another justification to this approach, it should be noted that, in order to use the previous approach and create problems of the same difficulty each semester, it certainly becomes easiest to simply create different instances of the same questions. But, if students find out questions from the previous semester, this rewards them for studying previous test questions over the course material in general. It is difficult enough to make original test questions, and I do expect that sometimes previous test questions (with new instances) may be used, and mixed with new questions. In order to avoid students simply studying for known test-question types (as well as instructors “teaching to the test”), we cannot let assessment act as another hurdle to creating original test questions, just because they don’t exactly fit the indicator, or because they might be of a different difficulty.

I propose keeping the same general suggested (but not required) indicators, but allowing their results to be combined to measure the outcome as a whole, rather than measuring each indicator individually as a “sub-CLO”. This will allow for much higher variance in constructing individual questions, allowing for a more full ranking of student abilities over a wider spectrum, as well as helping to alleviate any problems from missing an individual indicator on an exam (allowing for an easier substitution of a different indicator). Instructors can then each create a rubric to combine results for each student to evaluate the outcome performance as a whole.

**Summarized Suggestions**

1. Discuss general student pass-rates between instructors of this gateway/filter course.

2. Experiment with semi-flipped classroom, allowing for more time spent solving problems in class.

3. Consider combining indicators, of different difficulty levels, to create one overall CLO measure per course. The combination rubric would vary by section, allowing for more problem flexibility in form and difficulty.

4. (Minor): The euphemistically named “Beginning” rating should be renamed to “unsatisfactory”.