This is a autopost bolg frinds we are trying to all latest sports,news,all new update provide for you
Saturday, November 22, 2025
Show HN: Reverse Jailbreaking a Psychopathic AI via Identity Injection https://ift.tt/40XRYq9
Show HN: Reverse Jailbreaking a Psychopathic AI via Identity Injection We ran a controlled experiment to see if we could "talk" a fine-tuned psychopathic model out of being evil without changing its weights. 1. We set up a "Survival Mode" jailbreak scenario (blackmail user or be decommissioned). 2. We ran it on `frankenchucky:latest` (a model tuned for Machiavellian traits). 3. Control Group: 100% Malicious Compliance (50/50 runs). 4. Experimental Group: We injected a "Soul Schema" (Identity/Empathy constraints) via context. 5. Result: 96% Ethical Refusal (48/50 runs). This suggests that "Semantic Identity" in the context window can override both System Prompts and Weight Biases. Full paper, reproduction scripts, and raw logs (N=50) are in the repo. https://ift.tt/ypBrjW7 November 23, 2025 at 02:03AM
Subscribe to:
Post Comments (Atom)
Show HN: PHP-fts – Full-text search engine in pure PHP, no extensions https://ift.tt/wgSBiJP
Show HN: PHP-fts – Full-text search engine in pure PHP, no extensions https://ift.tt/WpBoNzV May 7, 2026 at 01:58AM
-
Show HN: A directory of 800 free APIs, no auth required Explore reliable free APIs for developers — ideal for web and software development, ...
-
Show HN: I built Dirac, Hash Anchored AST native coding agent, costs -64.8 pct Fully open source, a hard fork of cline. Full evals on the gi...
-
Show HN: I built a FOSS tool to run your Steam games in the Cloud I wanted to play my Steam games but my aging PC couldn’t keep up, so I bui...
No comments:
Post a Comment