SAEs trained on the same data don’t learn the same features

    0

    In this post, we show that when two TopK SAEs are trained on the same data, with the same batch order but with different random initializations, there are many latents in the first SAE that don’t have a close counterpart in the second, and vice versa. Indeed, when training only about 53% of the features are shared Furthermore, many of these unshared latents are interpretable. We find that narrower SAEs have a higher feature overlap across random seeds, and as the size of the SAE increases, the overlap decreases.

    NO COMMENTS

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here

    Exit mobile version

    Notice: ob_end_flush(): Failed to send buffer of zlib output compression (0) in /home2/mflzrxmy/public_html/website_18d00083/wp-includes/functions.php on line 5464