Systems and processes retrospective (AI Alignment March 2024)
This part of the retrospective is primarily for me and Li-Lian, to understand how we can continue making it easier to run a high-quality course at scale.
How was the participant and facilitator experience?
This section looks at the journey most people took through the course, how that experience was for them and what we did to improve it.
Marketing
We discussed this in our AI alignment March 2024 course retrospective for getting good people on the course. In short, most people came from referrals, with paid ads performing quite poorly.
Landing on website
We solved many ‘obvious’ problems on our website. Our website previously was fairly poor at explaining what we did, and didn’t communicate even basic information people want about our courses - let alone any of the value of taking our courses. Early this year we made substantial changes after reviewing the site ourselves and conducting user interviews. You can see the differences we made on our homepage (before, after) and alignment course page (before did not exist, after). Key changes we made included:
- Explained the key value propositions of the course, by adding ”Why do the course” sections.
- Published course dates, so it was clear when the course people were signing up for would happen.
- Added buttons to apply for the course, rather than having the apply link hidden in the text.
- Removed irrelevant information, for example a confusing ‘We’re hiring’ button in the prime position on our homepage that didn’t even work. This also included deleting a lot of distracting pages or broken links.
- Removed wrong information, for example a prompt to ‘Please register your interest to be notified when future dates are finalised,’ despite knowing the dates and it actually being an application.
- Improved the testimonials displayed. We already had a big bank of better testimonials, and the ones on the website were not nearly as good as they could be.
- Published public information about facilitating the course.
- Improved the quality of content. For example, under the title “What is AI Safety?” we previously had “New technologies bring new opportunities for progress, but often come with novel risks.” This does not answer the question - we’ve now changed the content so it does.
- Created a page to list all our blog articles, so that they’re indexed by Google and can attract people to our courses.
- Removed features that annoyed most visitors, like smooth scrolling and page transitions.
User interviews highlighted the importance of these changes. User interviews suggested these problems were serious, and also that the updated website solved these problems for users. Some key takeaways included:
- People who applied before we updated the website expressed frustration about not knowing when the course started and how long it was (“kinda feel like I applied in the dark”), and how the discussion sessions worked (“would like to know how to plan for it, e.g. can I go on holiday, how it combines with my work”).
- People who applied before the website updates, but who were shown the website updates were satisfied that the website was improved. Some said they thought it would have both made them more likely to apply and have helped them write a better application.
- People who applied after the website updates seemed satisfied that this information was provided, and didn’t think there was any information they wanted that was missing beyond slightly more clarity on what we were looking for in their applications.
Applying
User interviews with people who applied resulted in several improvements. This included:
- Many participants mentioned second-guessing what BlueDot Impact was looking for in applicants. In response, we updated the form to make questions clearer and provided more guidance on what we were looking for (and plan to post more soon).
- Some applicants felt unable to express themselves thoroughly under the answer length guidelines of just 4 sentences. In response, we widened the answer length guidelines to up to 7 sentences per question.
- Some participants noted the university and city selection process was slow to load. In response, we removed the university selector as we found it wasn’t a great predictor of applicant success. We decided to keep the city selector, as it is helpful for us connecting people to relevant opportunities among other functions. We have started looking into ways to make the city field better (e.g. using Google Maps autocomplete), and are waiting on changes from one of our suppliers.
- Some participants got confused by the collapsing sections and did not realise some were collapsed by default. In response, set all the sections to open by default. We also provided feedback to our form supplier on improving error messages if a user tries to submit a form with a required field in a collapsed section.
- Some applicants were confused by the options in the demographic questions. We had previously taken the categories from the UK and US censuses, but there was a significant overlap between these which confused participants. We considered removing this question completely, but decided to keep it as it seems potentially useful for future analysis to ensure we’re treating participants fairly, and for evaluating any potential campaigns to support people from diverse backgrounds to get into AI safety. We ended up merging the categories and coming up with a narrower list (which includes an ‘other’ option for people to express themselves how they would like if they don’t identify with any of the options).
- Facilitators noted some facilitator-specific questions were vague. A question that asked how much of the course content facilitators were “familiar with” was particularly bad: the idea was to screen out people who knew very little about the kinds of topics the course covered, but in practice this wasn’t hugely helpful for us evaluating them, and had confused everyone we interviewed. We have since deleted the field and updated other questions people were confused by.
We updated our application questions based on learnings from evaluating applications this iteration. For more on what we learnt evaluating applications, see our AI alignment March 2024 course retrospective for getting good people on the course. Changes included streamlining the key questions, and more closely matching them to the scoring criteria - and we think this was subjectively better on the June course.
We decided not to ask people to pay upfront. We’re primarily grant funded at the moment, but explored the idea of becoming more financially independent by charging participants for courses. This would also validate whether we’re delivering value to users in their eyes. However we ended up deciding not to charge people upfront after learning more about pricing strategy and interviewing users:
- People who had taken other courses with us said they would have been willing to pay much more after understanding the course value, but didn’t know this before (although after the course there is no direct incentive to pay, which decreases the likelihood they would actually pay).
- Most applicants gave numbers that were much lower than what recent graduates of our courses valued the courses at.
- People gave completely incompatible ranges. We asked people for a minimum (where they would think it’s too cheap to be good, and be suspicious of it) and a maximum (where they would think it’s too expensive and would not purchase it). A lot of these ranges did not overlap including for many people who could be strong participants. Minimums ranged from $10 to $500, while maximums we got ranged from $85 to $2000.
- Several potentially very strong participants said cost would be a blocker for them. We noticed this was particularly common among civil servants and students. Compared to tech workers, both of these groups are on much lower incomes and aren’t able to get work to pay for courses.
Confirming place
We published clear decision deadlines. These were:
- Applications close: 7 February 2024
- Application decisions made by: 16 February 2024
We met this deadline, including responding to most applicants much earlier. We sent 100% of application decisions by the published deadline. We also sent 86% of participant applications and 81% of facilitator applications by 9 February.
This was a significant improvement on previous courses. For comparison, we previously did not specify dates publicly, so people were left in the dark. We used to get back to people more than 2 weeks after the deadline.
To confirm their place, applicants submitted their time availability. Our availability service collects people's time availability (similar to When2meet or LettuceMeet) and confirms their place on the course.
This generally worked well. This iteration was the first time we were using time availability submission to confirm course places. Nobody said they got stuck here, and most people did provide availability by the deadline without issue.
Some people submitted limited availability. The main difficulty people experienced was not understanding they needed to give larger blocks of time (e.g. we cannot schedule a 2-hour session if a participant has only given 1 hour of availability). Some participants gave very restricted availability, or availability late, which made finding them a cohort difficult. We have:
- Forced people to give longer blocks of availability.
- Made the related warnings more obvious and easier to understand (but this could still be improved).
- Considered warning people who give unschedulable blocks. We think this would be useful to implement, but haven’t gotten around to it yet.
- Considered giving people longer to provide availability. We decided against this for now given we also want a short time between people applying and being able to start the course.
Others want to use this system. Multiple people who have taken our course have asked us for access to this and our scheduling system. This is probably a sign that they found it a good tool. We have now open-sourced both (availability, cohort scheduler) in line with our commitment to working in public, but still have work to do to make this easier for people to get up and running.
Course onboarding
We published clear onboarding deadlines. These were:
- Applications close: 7 February 2024
- Course starts: 3 March 2024
We met this deadline. The first lead cohort discussion session was on 27th February 2024.
We’re continuing to improve here. We have worked on decreasing the time from applications close to the course starting. For the June 2024 iteration the application deadline was 3rd June 2024, and the first lead cohort session ran on 18th June 2024.
We sent overly complex onboarding emails, which we’ve since simplified. They contained a lot of content and multiple calls to action that were not obvious. Based on questions we received from participants, we think it’s unlikely participants actually read these emails carefully. We’ve since narrowed down these emails significantly and made the CTA much clearer. You can see a before and after:
Scheduling cohorts was generally fine. Our cohort scheduling tool helps us find times for cohorts to meet that best match people’s time availabilities and experience levels. It worked fine without any major changes. A minor change we made was to better support some functions when running courses in parallel.
People appreciated us sending them calendar invites. “I found the calendar integration very useful!” A few people had difficulties receiving the calendar invites, particularly those using Microsoft 365 calendars - resending these invites seemed to help.
93% of confirmed participants got to their icebreaker session successfully. This suggests we gave them sufficient information and notice, as well as scheduled it at a convenient time, for the vast majority of participants. Many participants also contacted us saying that they had missed their session due to factors outside of their (and our) control: for example a storm taking out their internet, or being delayed on a flight for several hours.
Opening event
We ran a poor-quality initial opening event. We tried to cram too much content into the opening event, and it was somewhat unclear what the purpose of the event was. We hadn’t thoroughly practiced and tested running large Zoom events and hit technical issues at the start of the call.
We ran a much better second opening event. Because we ran two opening events (to help people attend one, if they were in a different timezone), we were able to test a very different plan. We instead focused much more on networking and completing the session 0 exercises. We had ironed out most of the technical issues and it went much more smoothly. Going forward, we’re running even simpler opening events that primarily just focus on peer networking which been well received on our later courses. We have also written up our learnings for running larger events on Zoom, including having a separate person to the host responsible for ensuring all the tech works.
Learning phase: course hub
Participants could access key information on the course hub. Our course hub hosts the course materials, cohort details, and certificates. It also enables event registrations, cohort switching for participants, and session time changes or swaps for facilitators. In discussions with users and end-of-course feedback, participants said:
- They liked being able to track their progress and view the resources and exercises.
- People appreciated it working on both desktop and mobile. Mobile compatibility was useful for people being able to read resources throughout the day e.g. while travelling on public transport.
- People appreciated not having to sign in to access the resources. They mentioned this made it easier to survey the course, and look things up on different computers.
- Facilitators knew how to use the more advanced features, including changing session time, although they did this rarely.
The course hub has significant room for improvement, particularly with reliability and performance. The most frequent negative feedback in the end-of-course survey related to course hub bugs (16% of responses).
- Participants were particularly frustrated by losing exercise responses: “The course hub was a bit confusing sometimes and when I would hit submit on my exercises, they would not update and I would lose a lot of work.” We’ve been working on improving some of these problems, and migrating suppliers to ones we hope will be much more reliable. However, this takes time and the course hub still has many of these issues - in the meantime we have been advising participants to save a copy of exercise responses in a separate document to avoid losing them.
- Participants were also frustrated by losing reading completions: “The CourseHub was acting up sometimes with marking readings and not complete yet even though they were.”
- Course hub loading and performance issues also caused problems for participants and facilitators. They noted that pages not loading properly or quickly when trying to join sessions was stressful: “The Course Hub could be more stable. It was sometimes unresponsive shortly before or during the sessions, which caused some disruption.” We have since also included meeting join links in Slack and on the calendar invites.
- Participants also noted the cohort switching tool often took very long to load or didn’t load at all: making it harder to switch cohorts or causing them to request manual switches (the impact on us is explored more below).
- A facilitator suggested the course hub could better support them in finding cover for sessions they might miss. We’re not sure we want to encourage switching facilitators too much at the moment, and if it’s a big enough scale problem for us to focus on. However, we’ll review this to see if there is more demand in future or if our existing systems are failing here.
- Teaching fellows found the course hub particularly slow, likely due to the additional data that had to be loaded. We ended up mitigating this somewhat by exporting CSVs or sharing Airtable views of their cohorts with them.
- People not on the facilitated course were confused because they didn’t have context on parts of the course. We’ve since updated the website with more details about the course, and moved more course communications to core resources on the course hub itself rather than just sending emails to participants on the facilitated course. This makes it a better experience for independent learners and local groups running the course, and gives us flexibility to update guidance in response to feedback more easily.
Course participants switched cohorts a lot. We received 1052 cohort switching requests, of which 15% required manual intervention from our team - we discuss this more in “What manual tasks took up our time?” below. At this time I think we’re striking the right balance of flexibility vs cohort stability (we recently reviewed this on the AI Governance (April 2024) course) but we are keeping an eye on this.
Learning phase: Slack
In general, people found the Slack useful for coordination and networking. Participants engaged with the Slack in different ways, including some who spent a lot of time in wider discussion channels and others who mostly kept within their cohort channels. Activity within cohort channels also varied a lot: some were very active, arranging in-person meetups or sharing resources - while others were quieter and generally just discussed joining the discussion sessions. Most feedback on the Slack was positive.
Some facilitators found the Slack overwhelming. For example: “I did find that it took me a little more time than I'd preferred to keep up with the channels and to read through all of the #facilitator discussions to sort through whats important and what isn't.”
Learning phase: meet
Participants used our meet service to join their cohort discussions. Our meet service connects participants to the right Zoom rooms and records participant attendance.
In the first couple of weeks, some participants using meet had audio issues returning from breakouts. This was due to a bug in the Zoom Web SDK on certain browsers. We mitigated this by opening the Zoom app by default (rather than the web version), and providing a workaround in the icebreaker session plans (mute and unmute yourself). We reported this to Zoom and updated the SDK when a new version was available that fixed this.
The meet service generally worked without significant issues after week 2. No other significant positive or negative feedback was given on the service after this point (minor feedback included one suggestion to provide more guidance on how to use Zoom for participants, and another was to use Microsoft Teams as it was better supported on government computers).
Learning phase: removing inactive participants
We introduced automations this course iteration to automatically remove inactive participants. This would remove people from the course that missed two sessions in a row. Previously we removed inactive participants manually (although usually weren’t that proactive about doing this).
Removing inactive participants didn’t seem obviously beneficial on this iteration. We don’t have attendance data for a previous alignment course to compare against, and the attendance curves for pandemics and AI governance are so difficult it’s hard to compare. There aren’t obvious patterns in the absences data that show reduced absences: for example I might expect absences to decrease significantly in session 3 of the project stage (compared to sessions 1-2), corresponding to the drop off in attendance moving into the project phase. However, there was only a 3% drop in absences from sessions 2 to 3, and absences seem fairly constant throughout the project stage.
This might have been due to bugs in the implementation. Fairly late in the course, we realised that the automation had been removing some participants it shouldn’t have due to a bug in the automation (that has now been fixed). This could have also been not removing participants that should have been removed, explaining the failure to see an effect above.
The automation was disruptive on this course iteration. A bug in the automation resulted in erroneously sending emails to some participants telling them they had been removed from the course. They were then confused why they received the email, and this resulted in a lot of manual contacts for us. This was a bad experience for participants and created substantial extra work for us to clean up.
A correct implementation probably doesn’t seem too disruptive. We received very few contacts from people who had been removed from the course for genuinely missing sessions, and I think in principle this would have been helpful for managing attendance had it worked properly.
Learning phase: audio quality improvement
We ran a scheme to improve discussion audio quality. We published guidance on improving audio quality, and ran a reimbursement scheme for participants to buy better microphones. This was inspired by previous participant feedback from multiple people that having fellow cohort members with bad audio resulted in a worse learning experience, and our own experiences reviewing some lead cohort sessions.
It’s hard to evaluate the impact of this scheme. We didn’t receive feedback regarding audio quality issues this iteration, which could suggest it helped. Our initial reasoning for the scheme was that we were well placed to offer small grants (as we had infrastructure already set up).
We have some plausible evidence it was helpful. Spot-checking people who purchased new equipment, they stayed engaged with the course (although we think most of the benefit is actually to the other participants their in a cohort with!) and some offered more context explaining how helpful they thought it would be. We also think it’s likely that setting up such a scheme made it more likely that others would read our audio quality guidance and consider upgrading their setup without applying for funding (also see similar reasoning in our rapid project grants below). Anecdotally the audio quality after launching this scheme in lead cohorts did seem to improve.
It was very cheap, totalling £261.39. 7 people made claims, which were all approved and paid. A minor adjustment was made to one claim to cap it at the £30 limit. It also took very little of our time to set up this scheme, and review claims.
We were fast to approve and pay claims. On average, we paid[1] claims in 52 hours.
Project phase: guidance
People wanted guidance earlier. Several participants asked about the projects before we had properly planned this out. Some facilitators highlighted that participants were keen to understand what was happening with projects earlier. We have now published guidance for future iterations.
People wanted examples of projects. Another common request was for project examples earlier. We have now published lots of great examples.
People wanted clarity on late submission, particularly with regard to prizes and certificates. We had not really figured out this policy until people asked us. For this iteration, we only evaluated projects submitted by the deadline for prizes, and awarded certificates to participants who either attended almost all learning phase sessions OR attended most learning phase sessions and submitted a moderate or high-quality project submission. We could have decided on this earlier and published this as a clearer policy - we still have not done this for the AI Alignment June 2024 course yet.
Some participants didn’t know where to submit projects. We initially did not link the project submission form directly from the course hub, and the project submission resource had a lot of unrelated links. We updated both to make it clearer where participants could submit a final project.
Project phase: onboarding
Reshuffling cohorts was a mistake. This caused confusion among participants and facilitators, created extra work for us, and some participants explicitly said they would have preferred staying in their original cohort: “I would have preferred to keep the same cohort [for the project sprint] as the discussions for the first 8 weeks.” We explore this a little more in our course development retro.
Project phase: rapid grants
We launched a rapid grants scheme to help participants with their projects. We had not previously done this before. We launched this because several participants asked for support here, and we had the infrastructure to do it easily. This scheme primarily focused on giving small grants of £50 to £2000 to unblock low-cost projects (the median application was for £267, and granted amount was £200). It operated on a reimbursement model and covered direct costs: for example people could purchase compute, software tools, travel tickets etc. and would be paid on presenting receipts.
Without the scheme, participants seemed unlikely to have achieved their project goals. We framed the scheme as “targeted at people who, without funding, would not be able to reasonably afford the resources they need for their chosen project”, and we asked most applicants what they would do should they not receive funding. Most participants noted that not having the scheme would either: (1) make them not do the project at all, (2) do a much more limited project that we also thought was considerably worse. One participant mentioned that they probably could pursue the project without funding (by using worse tools), but it would take away time from doing the project well.
These project goals were ambitious and several impressive projects came out of the grant scheme, including 2 project winners and several other high-quality projects. These grants resulted in tangible outcomes (such as specific project outputs), several of which we thought were high quality. This is despite the primary focus being to support effective learning, rather than directly impactful project outputs. Grants we made included:
- £200 to evaluate four different defences against poisoned RAG, write up the results, and publish the publishing the code to do this. This project won the 'Novel research' prize, a new prize category we created given an expert judge we brought in said some projects were so excellent they needed to win something, but they weren’t captured well by the existing prizes.
- £301 for evaluating large language models’ ethical, political and moral beliefs in different languages, building a visualisation tool for viewing these results, open-sourcing this tool, and writing a corresponding blog analysing this. This project won the runner-up 'Interactive deliverable' prize.
- £93 to build a tool that automatically identifies AI-relevant documents from the US federal register, hosting the results on a website (including LLM summaries of how the document relates to AI) and open-sourcing this tool.
- £146 for running an existing benchmark for evaluations of dangerous capabilities (bio, chem and cyber) for 11 open-source models, and writing up the results.
- £788 to develop a novel interpretability technique to control model behaviour without decomposing features with SAEs, and publishing the code behind this work. This grant was made to a pair of participants after initial work showed promise, and they needed more funds to cover compute for further experiments.
We likely helped other participants, even those who didn’t apply for the grant scheme. In discussions with participants, a couple mentioned that knowing that the grant scheme was there helped them be more ambitious with their project ideas, even though they didn’t end up applying. For example, one participant wanted to run some open-source models and was worried about the costs. By having the grant scheme, they went through the process of calculating an estimate of costs and realised it would cost them ~£10 instead of the thousands they thought it might be, and just self-funded their project. They said without the grant scheme they definitely would not have calculated the cost, and likely would instead have done a project that would be much less useful for their professional development (and in our opinion, less likely to help them get into AI safety).
It was relatively cheap, and we expect this to cost us a total of ~£2250. So far, we have paid out £1329.63, but at the time of writing the deadline for claims is just under a week away. Based on estimates people gave us when applying, plus the value of claims to estimates so far, we expect to pay out about £2000 in total. In addition, we expect we spent no more than £250 of staff and contractor time processing grant proposals and reimbursement claims.
We were incredibly fast to approve grants. On average, we made final decisions about grant applications in 7.2 hours. Our slowest response was 31 hours, because the proposal was submitted during a bank holiday weekend.
We approved 93% of grant applications. This includes grant proposals we agreed to fund, but with a different limit than initially claimed - we both offered more funding than was initially claimed for some projects, and offered less than the total amount of funding for other projects where we thought they could be carried out more cheaply. The high acceptance rate likely reflects that we have already effectively pre-screened for interest and some level of expertise given applicants have all just engaged for 8 weeks learning about AI safety. We also already put a lot of effort into helping people develop strong project ideas, plus the reimbursement model makes offering funds much less risky for us and less attractive to anyone just trying to extract money for unrelated work/benefits. It might suggest more people should consider applying for funding.
We were fast to approve and pay reimbursement claims. Excluding one complicated claim, we paid claims in an average of 3 days. The complicated claim took 20 days, and was due to a participant living in a country our bank didn’t support payments to, and there was some back and forth to arrange a PayPal transfer instead.
Closing event
Overall, the closing event went very well. We received positive feedback on the event, and it generally ran pretty smoothly. This is in large part due to our excellent graduates that we’re very proud of - both those who presented great projects with little notice, and those that engaged with others in the final discussion breakouts.
We could have been clearer on the plan ahead of time. We had planned to have participants give presentations in group breakouts. We later realised it would be difficult to do this on Zoom, as well as ensure presentations were all high-quality and engaging. We therefore didn’t run these breakouts, meaning some participants prepared presentations that weren’t used in the closing event which would be frustrating (but we encouraged participants to record and publish their presentations on YouTube or other social media). Going forward we plan to commit to the structure we used in this event from the start given it worked very well, so we don’t change the plan for participants.
Certificates
Distributing certificates went smoothly. Generally we had few contacts about issues with certificates, while we have seen several participants posting about them on social media suggesting they do value them.
We could have been more transparent about the criteria for granting certificates. Most contacts about certificates were queries about eligibility. We decided on the exact eligibility criteria somewhat late, and while we think it was still fair it might have acted as a better motivator and given participants greater reassurance to be more upfront about these criteria.
We could issue certificates to facilitators. One facilitator said they’d have appreciated getting a certificate similar to participants, but that stated they facilitated the course. This seems like a very low-cost way to recognise the excellent work done by our facilitators.
What manual tasks took up our time?
Curriculum development
Building a high-quality curriculum took up the most time. This involves figuring out the right overall structure, identifying relevant learning objectives for each session, finding (or creating) corresponding resources, and developing exercises and activity plans. This is also an ongoing process because we continually update the curriculum based on feedback as people are going through it.
I expect curriculum development to take 70-80% less time for the next iteration. This guesstimate is based on the fact that 5/8 learning sessions are likely staying mostly the same, and because I myself have an improved understanding of the AI safety landscape and the resources available.
Surprisingly few great (or even passable) resources exist on a lot of core topics. For example, we thought there was no excellent introductory resource to scalable oversight, mechanistic interpretability, or technical AI governance at the time of creating the course. We’d really appreciate people writing better summaries and introductory explainers, and have posted some thoughts about articles we’d like to exist.
Writing our own resources was useful, and may help attract more people to our courses in future. People often commented how much they appreciated our summaries, and said they were some of the best resources on the course - particularly in how easy they were to read, and relevant to the learning goals they had on the course. Given other feedback suggesting people look at a lot of our website before applying, having well thought out articles might also support their decision to apply or otherwise trust us.
Comprehension questions seem very valuable to create. Many people appreciated having these as a basic knowledge check, and despite them always being optional exercises a large number of participants attempted them. Creating high-quality questions that are quick for participants who know the right answers to race through, but do force participants to achieve (at least the knowledge parts, particularly for areas that are often misunderstood) the learning objectives takes practice but does not require as much effort as other parts of course design. I’d recommend to others creating courses to publish similar comprehension questions.
AI tools are surprisingly bad at supporting with resources. Many people suggested we use large language models to help us identify or write new resources. On the whole, we found that they usually hallucinated non-existent resources, and AI tools grounded on internet data don’t tend to find resources we weren’t already aware of. On writing, they often got technical concepts wrong or didn’t write in an engaging or accessible way (although they were often than whatever existing resources we could find on some topics).
AI tools were useful for developing comprehension questions. With a clear prompt, especially one that provides examples and a lot of constraints, AI tools can be very helpful for generating ideas for comprehension questions (and their answers) based on resources. They’re irrelevant or vague ~50% of the time, and wrong ~30% of the time, but we can simply generate many more questions and answers and filter to the 20% of good questions.
AI tools were not useful for developing other exercises. We tried many different prompting techniques, but found AI systems were not that helpful for developing ideas for exercises. Also a lot of exercises depend on what is available out in the world, which makes AI systems that can’t access this not great at this task.
AI tools were somewhat useful for generating activity plans. They could generate many reasonable activity plan ideas, although required a lot of context about our flipped classroom model and other constraints. Again, ~80% were rubbish but they were useful for the 20% of good ideas - or the bad ideas that inspired better session activities.
Reviewing resource feedback during the course was difficult. It’s useful to be able to see ratings and feedback as they’re coming in so we can update the course in response to this feedback. Unfortunately, issues syncing the data between our course hub platform and the interfaces we use to design courses made this much more difficult. Instead I regularly exported resource feedback as CSVs and reviewed this manually, but this did take quite a bit of time and effort.
Reviewing exercise submissions during the course was difficult. Similarly to resource feedback, being able to see what people are writing for the exercises can help inform whether the exercises are clear enough and how people are engaging with them. Unfortunately due to course hub bugs many exercise responses would be lost or only partially saved - and a further loss of trust in the system meant more people shifted to only storing exercise responses privately themselves. This meant I couldn’t see a lot of exercise responses and had to rely on less frequent general feedback from participants and facilitators on the exercises.
Evaluating applications
We previously manually scored and decided on each application. We previously ranked each application to our courses on a number of objective rubrics manually to give scores (usually between 1 and 5) on a number of metrics. We’d then use these scores and a quick look over applications to make a final application decision for each applicant.
We’re still manually deciding on each application, but have started scoring applications with AI. We ran a pilot project that found in general we could get AI systems to score applicants with the same accuracy as humans, with sufficiently clear criteria definitions and tweaking the prompt.
We think we’re doing this responsibly. As well as being broadly transparent that we do this (including publishing the production source code), we highlight how we do this and people’s right to request human re-review of the scoring part of their application in our privacy policy. User interviews with course applicants who had already been accepted to the course showed applicants were generally surprised that there was manual oversight at all - most thought it was already fully automated, or based on much more simplistic criteria. Being upfront with people seems to have gotten fairly good reactions.
We’ve got safeguards in place to catch outliers. A key worry I have is that lack of diversity in AI systems can result in biases that exclude people from key services. We’ve tried to avoid this by ensuring we have genuine human oversight in the process, objective criteria that are clearly relevant to the course, and rubrics for detecting people more likely to be outliers for further review. However, we haven’t rigorously evaluated this - ideas on how to do so would be welcome!
It still took a lot of time to evaluate applications. The human review step still took several days, and is incredibly taxing mentally - reading over a thousand applications while carefully ensuring fairness is quite a task. Many applications were not written in a way that effectively highlighted the person’s strengths and relevance to the course. We have provided more guidance on this, which does seem to have helped for the June 2024 course although it’s still far from perfect. We have also had some ideas of adjusting criteria to better identify great applicants, or using methods other than single application scoring (such as application comparisons, then developing ELO scores for applicants).
Our scoring doesn’t capture everything. This is clear when looking at our human evaluator opinion against the naive total score (in practice we rarely use this total score, instead preferring to use some boolean combination of scores):
Manual cohort switching
We received a huge number of manual cohort switches. This was both because there were a large number of cohort switches, and a reasonably high proportion of those were manual switches: 15% of 1052 requests.
Manual cohort switches are inherently particularly taxing. A manual cohort switch represents someone who has requested our team to find them a new cohort, (usually) because none of the sessions available to them in the automatic switching interface worked for them. This also usually means there are no really great options to put them in - and we have to make judgement calls about having oversized cohorts or cohorts with very different skill levels. Quite frequently people requesting manual switches offer very limited flexibility in time availability and there is simply no good session to move them to.
Manual cohort switches can be quite disruptive. As well as usually violating the guidelines for good cohort design (in terms of size or skill level), manual cohort switching requires course coordinators to context switch to bitty uninteresting tasks given manual cohort switches are time-sensitive.
Course hub bugs and performance issues led to far more manual switches. At some point, automatic cohort switching broke and we got flooded by a lot of manual switches. Additionally throughout the course loading and performance issues meant that the interface didn’t show available automated switches properly or quickly, likely leading to far more manual switches than otherwise. Fixing these bugs should be considered high priority given how much they might be able to avoid disruptive manual cohort switches.
Discouraging manual cohort switching seems useful and effective. Later in the course, we updated the course hub interface to highlight some of the difficulties with manual cohort switching. We also made it mandatory to provide a reason the participant was not able to make their normal session or make an automated cohort switch. This reduced manual cohort switches quite a bit. While there is likely a trade-off here between people deciding to miss sessions completely, we think it was probably worth adding a little friction to reduce the cost on us.
Handling switches could be less costly. Currently our interface for managing manual cohort switches shows a number of cohorts by default, not all of which match the user’s time availability or skill level. We could probably build an automation that suggests some recommended switches, which I guess would save 30-50% of the time of manually switching people between cohorts. We could potentially provide these to participants on submitting a manual request.
Emails and Slack messages
The core BlueDot Impact team received ~300 email requests, and ~250 Slack message requests. By requests, we mean asking us a question directly or implicitly, or asking us to do something. This excludes: providing feedback about the course, chatter about the course, requests we initiated (e.g. for user interviews), and requests to facilitators.
We would like to deeply analyze these requests, but haven’t had time to do so. We’d likely be able to better identify trends to enable prioritising future work by analyzing these requests. Unfortunately, both Google Groups (which we use for email) and Slack don’t make it easy to export all this data. If this sounds like a task you’d be interested in doing, we’re hiring a product manager and software engineer to help us solve issues like this (among lots of other exciting things!). We might also be interested in anonymising the requests and publishing them as a dataset so people can help us build better tools for handling these. We have started a pilot with 8ai (by Florence Hinder) which has trained a chatbot on some of our responses, which shows some promise at supporting us here although isn’t perfect.
The most common request tended to be about switching cohort. This often related to bugs on the course hub relating to cohort switching, unclear information about how to switch, or queries about the rules or policies around switching and attendance.
As the community has grown, more people seem happy to help each other out. Something that has saved a lot of our time and effort is others helping each other in various Slack channels - we greatly appreciate when people help us like this! Slack channels (rather than DMs or emails) also allow others to search for answers, so we prefer people posting questions there.
There are a lot of quite awkward or unique requests. It’s unclear whether what we should do about this big tail of unique requests. We have slowly started to say no to more of them, but we have also considered hiring additional support (such as contractors) to help us process these.
Managing attendance
We review past and expected attendance regularly. Reviewing past attendance gives us insights into how cohorts have been going - sometimes if they’re getting too small we disband them or move people around. Reviewing expected attendance can help us spot when a session might be small in advance, and either shift people to it or disband it.
More people being on this course iteration made this easier than previous iterations. A larger course size makes it more likely that cohort sizes seem to balance each other out, and that people can find alternative cohorts in reasonable slots. This also makes it possible for us to be stricter about healthy cohort sizes, because we can reasonably move people to other cohorts that suit their availability.
Learnings from previous cohorts also helped here. In particular:
- Running sessions from Sunday to Saturday. Despite how us UK folk feel about starting weeks on Sundays, this makes it much easier for people to make it to sessions and for us to rearrange things. If someone is away one weekend, but only has time on weekend days to do the course, they can switch into a session a weekend early/late.
- Intuitions about changing the target cohort sizes were useful. In general, we’d keep the maximum size around 8 or 9, but increase this if we saw lots of manual cohort switches. We also kept this much lower during the project sprint, where smaller cohorts seem preferable to have longer chatting with the facilitator.
We made disbanding cohorts a smoother process. While not perfect, and still an unideal experience for course participants and facilitators, we introduced automations to help us disband cohorts for this iteration. This makes it easier for us to switch them into different cohorts and notify them about this change.
Conclusions
This course iteration saw significant improvements in many areas, from marketing and application processes to curriculum development and participant experience. Key successes included better website communication, faster application processing, improved audio quality initiatives, and the introduction of rapid project grants.
Some challenges remain for future courses, particularly with the course hub's reliability and performance, and managing the high volume of manual cohort switches. We’re aware of both of these issues and are already working on them - but I think this highlights how important a reliable course hub is.
Overall, the course probably went the best it ever has, and we still learnt a lot to run the June 2024 iteration even better.
Go back to the top-level retro page.
Footnotes
By time to payment, we mean from someone submitting the claim form to money leaving our bank account.
For most developed countries this is basically the same time that the recipient receives money. For the US and Canada, and other countries with less developed payment systems it may take an additional 1-5 working days for the person to receive the money.