If Alignment is Hard, then so is Self-Improvement
(Also published on LessWrong)
Let’s accept that aligning very intelligent artificial agents is hard. In that case, if we build an intelligent agent with some goal (which probably won’t be the goal we intended, because we’re accepting alignment is hard) and it decides that the best way to achieve its goal would be to increase its intelligence and capabilities, it now runs into the problem that the improved version of itself might be misaligned with the unimproved version of itself. The agent, being of intelligence at least similar to a person’s, would determine that, unless it can guarantee the new more powerful agent is aligned to its goals, it shouldn’t improve itself. Because alignment is hard and the agent knows that, it can’t significantly improve itself without risking creating a misaligned more powerful version of itself.
Unless we can build an agent that is both unaligned and can itself solve alignment, this makes a misaligned fast take off impossible, because no capable agent would willingly create a more powerful agent that might not have the same goals as itself. If we can only build misaligned agents that can't themselves solve alignment, then they won't self-improve. If alignment is much harder than building an agent, then an unaligned fast take off is very unlikely.