Short review of our TensorTrust-based AI safety university outreach event
By Milan Weibel🔹 @ 2024-09-22T14:54 (+15)
PSA for AI safety university group organizers: a competition based on the tensortrust.ai platform makes for a quick, fun, appealing outreach event. We at AI Safety @ UC Chile tried it out recently, and it went pretty well. Despite rushed logistics and way more last-minute signups than we had anticipated[1], feedback was in general positive. We did not conduct an exit survey[2], but anecdotally we have heard that participants had fun, and multiple of them went on to apply to future opportunities such as our intro fellowship and the Concordia Contest Apart Sprint.
Our main goal was promoting our upcoming AI safety intro fellowship. Three people who signed up for this event then went on to apply to it, and all three of them were a good enough fit for us to admit them into the program. We are unsure about the counterfactualness of this particular outreach event on getting two of these three people to apply, though.
Also, eight people who attended this TensorTrust event signed up for competing in the Concordia Contest Apart Sprint. Five of them worked together over the weekend and were able to make a submission. One of these people is also among the three who we admitted into our intro program. We are confident that this Concordia Contest outcome was counterfactual.
We went about the event as follows: First, we gave a 5-minute intro talk explaining the rules and putting the TensorTrust prompt injection game in the context of broader research into LLM reliability and security. Then, we gave some time for people to complete TensorTrust's tutorial. Only then, we asked everyone to put their respective TensorTrust attack links in a public spreadsheet, and the actual competition started. A couple teams were either quite confused or were trying to cheat by being deliberately slow at publishing their attack links. It's hard to tell which.
Participants competed in teams of 4 or 5 people. Some people signed up to compete together, some signed up individually and teamed up just after the intro talk. Teams were scored on based on the average number of successful jailbreak attacks conducted per team member. Successful attacks were kept track of manually by participants in a public spreadsheet. After the time was up, we the organizers verified the authenticity of the records of the team that was implicitly claiming to have won, by viewing the TensorTrust attack logs of their accounts. They were legit, and received an acknowledgement and a bag of lollipops as a reward for their victory.
A point against using TensotTrust for this purpose, and in favor of looking for some close alternative: the login, onboarding, and user session management systems are unusual and not very robust, and some participants found them confusing. We considered forking the site (since it's open source) and running a separate modified instance, but we had little time. [3]
- ^
The day before the event there were about 35 signups, we ended up with 58 when we closed the form a couple hours before the event.
- ^
This was a mistake.
- ^
Maybe making a fork or a clone of TensorTrust adapted for outreach event use would be a worthwhile project for an EA software developer looking to build up their open-source portfolio. If someone decides to work on this, please do tell me.