Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I guess the Scunthorpe problem is still a thing.

https://en.wikipedia.org/wiki/Scunthorpe_problem



My particular favorite is this one: "In October 2020 a profanity filter banned the word bone at a paleontology conference."

https://www.vice.com/en/article/dyzamj/a-profanity-filter-ba...


Why are we letting tech companies treat us like children?

Once I confirm to them I'm an adult, I should be able to choose to see everything.


A long time ago, I saw a guy unable to post on a corporate blog of his own team. Turned out that his name was flagged by a filter.

What made this particularly egregious is that the name in question: "Hui" - wasn't even a swear word in either his own native language - Chinese - nor in English. But it closely resembles a Russian profanity. Turned out that the filter was "multilingual", and applied rules for all languages to all posts...


Why the hell would it even apply Russian filters to something that isn't written in Cyrillic? And this isn't the best English transliteration of that word either... That's really some dedication.


Yeah, they would have to do literal translation based on phonetics. That is just insane.


> they would have to do literal translation based on phonetics.

That seems pretty unlikely to have happened here; I don't know the Russian word in question, but the Chinese "hui" rhymes with English "clay". (It also rhymes with the more sensibly spelled Chinese "wei"; the 'e' is only omitted when the syllable begins with a consonant. Compare "feng shui".) I'd be surprised if that were a possible reading of any Russian that might be transliterated "hui".


A native Russian speaker who is familiar with how Russian is usually transliterated, but unfamiliar with Chinese, would read it very similar to the Russian word in question ("khoo-y").

As to why the filter was applied to Latin characters - I'm not sure, but I'm assuming that's to prevent people from using translit to sneak in profanities. Of course, this ends up being a pointless game of whack-a-mole - there's so many possible ways to spell something like that with Unicode...


Huh, I looked up the word. хуй?

Looks like Russians and Americans can find common ground on thinking Chinese last names look like "penis", even if we're making fun of Wang and they're making fun of Hui.


In Chinese socialism is shè huì zhǔ yì, which had to be intentionally misromanized as шэхуэйчжуи to avoid dick jokes.


What's the mistake? Look up; хуэй is a much better representation of the pronunciation of 会 than хуй would be. The pinyin spelling "hui" omits the primary vowel of the syllable.


As another native Russian speaker, "i" isn't the most common transliteration for "й", and that's what bothers me here. "Hui" would be a plural, with an "и". Й is usually written as "y" or "j". Except when you're getting an international passport, then there's a good chance your name will end with "ii" because the federal migration service hates you.


It's not the most common transliteration, but it's common enough; and even in Cyrillic, if you see "и" where "й" would normally be expected, you'd usually read it like the latter; e.g. "йод" is sometimes spelled "иод", but everybody will read it the same. Given that the written distinction between и/й dates back to Peter's civil script reform, and that it wasn't even considered a separate letter of the alphabet until the 1918 spelling reform, it's not really surprising.


Hm. I thought Й was being used in the old style (pre-1918) writing as well? At least this[1] translator keeps it in masculine adjectives. Though it doesn't keep the dots on a Ё. I've never seen И substituted for Й, but Ё -> Е is common, especially in names (for example some people write "Артем" but everyone still reads it as if there's a "ё").

[1] http://slavenica.com


It was used before 1918 - it was first standardized in the Civil Script (1710). But it wasn't considered a separate letter until 1918 - so e.g. the standard alphabetic sorting ignored the distinction. For this reason, it wasn't always used consistently, although it was still much more consistent than Ё. And even today, "иод" is still considered valid spelling; indeed, it's the preferred one in scientific context.

This still shows up in some contexts - e.g. Й, like Ё, isn't used in bullet lists; try it in Word - it'll go from И straight to К.


I'm just agog that people are still doing dumb pattern matching for profanity filters. I just assumed that YEARS AGO people realized how dumb it is, but apparently: No.


This is Google. It's probably very smart pattern matching for profanity.

The neural network may have taken millions of core-hours to learn to be as dumb (here) as a blind keyword search.


We had to give up our privacy to create a highly sophisticated technology that doesn't even work half of the time. I love the future, it was totally worth it.


Well, obviously. If it were a dumb profanity filter then it would be possible to fix it!


I once had a bug that I traced back to a rule (can’t remember in which part of the stack - though I think it was client controlled IIS) that was striping the “select” from the word “selected” in query string params in an attempt to thwart sql injection. From memory it was naive enough that “sselectelect” was converted nicely in the process.


Similar: Yahoo used to (2002) replace any instance of the character sequence 'eval' (and other 'bad' strings) in their emails, in an attempt to prevent Javascript exploits. Needless to say it created a small amount of havoc!

http://news.bbc.co.uk/1/hi/sci/tech/2138014.stm

https://en.wiktionary.org/wiki/medireview


I hadn't heard of this and I'm now flabbergasted. Is it even legal for a service provider to secretly change email contents? It's absolutely outlandish to imagine how someone first thought this could be a good idea, and then found someone capable of executing the plan and apparently agree.


I had similar issues, the software is Mod Security that some hosting companies use and some rules will empty out your POST request field if it contained text like ".... select ...from..." where the 2 keywords were paragraphs apart.


Not super relevant or anything but I just can't help but share my favorite profanity filter story, so here you go.

I worked at a place that had a profanity filter in two parts.

The first part was in C, several pages of if (!strcmp(x, a)) return 0;

After all that, it then invokes popen() to ssh to another machine and run a shell script there, which contains several more pages of string comparisons, this time in shell.


Doesn't popen() pass strings to a shell? Sounds dangerous, as you would have to escape semicolons, quotes, etc.


I might be wrong but I think it's about censoring the 'hell' in 'shell'. Because some parts of the world consider words like 'hell' and 'damn' to be profane.


Hold on, what? Okay, return false if they aren't equal, then open another process to repeat this method once again in the shell... I can't guess the reason. Would you know if there is any reason this might have been done?


I wouldn't know the real reason for sure, but this seems plausible:

1) They got tired of having to modify C code and wait for the deploy cycle to modify the filter

2) Using, for example, the database would be more work than calling a shell script. On top of that, it might actually be beyond the abilities of the programmer involved.

3) The C code executes on an arbitrary machine. Hence the ssh to a specific machine, so that the shell script would only have to be maintained in one place


strcmp returns 0 if the strings are equal.


A great many places do this and automatically refuse content based on arbitrary “bad words” regardless the context.

I remember being denied to post a forum post containing the phrase “tardive dyskinesia”, as it appears that it rejected anything with the string “tard” in it.

I'm not sure as to whom they think to be helping with that, but it's entirely possible that their advertisement revenue will actually suffer, if the string “tard” be found on their pages.


FWIW, general profanity detection is a highly nontrivial problem. It’s true that such subword profanity filters aren’t that great, but slightly more sophisticated ones (eg whole word matching or n-grams) tend to have relatively good precision. You could train a fancy neural network, but the overall return on precision and recall tends to be not that great (compared to the exponential change in speed and cost). The problem almost always crops up in out-of-distribution sentences (such as “bone” at a paleontology conference).


Even humans with full general intelligence and domain knowledge will fail at profanity detection. I think the problem here is not so much that there are false triggers, but that there is no way to deal with the false triggers — no way to appeal to reason or utility.


It's a problem with a subjective answer.

One man's profanity is not another man's profanity.

Of course, the personality trait of desiring censoring “bad words” seems to highly correlate with a belief in objective morality. — the others are wrong about what they find profane!


They just rebranded it as "AI-powered profanity filter" :)


Reminiscent to me of Call of Duty Warzone; it has loadouts which you can give custom names (that only you see!) which are protected with a profanity filter. Comically, some of the literal names of the guns are banned as being profane, like "MP5".


My CoD group of friends still occasionally calls the assault rifle "analsault". Stupid, huh? Not as stupid as an earlier version of CoD (Black Ops 1, IIRC) that wouldn't let you name a load out "assault $WHATEVER", 'cuz you know, "ass". But "anal" is so much better so that was allowed.

They fixed it in later versions, but I still have a "penetration" class because I'm immature that way.


See also Dark Souls multiplayer, in which you can see many “K***hts” running around.


Oh my god that's fucking hilarious.


Isn't it just to double-plus-ensure that no one "accidentally" uses a name that ActivisionBlizzard did not license from the appropriate gun manufacturer?

I.e. It has little to do with profanity but a lot to prevent someone from making screnshot of a loadout with a gun that looks like MP5, is named by them as "MP5 whatever" and behaves like an MP5 in some type of legal action?


I cannot imagine horror of the precedent it would be set if H&K successfully sued AB over copyright infringement for names that are visible only to the player who entered them. Those names are not shown publicly.


Whilst I agree - and fervently hope we won't have to live in such a world - I thought the same about the API copyrightability and that one is not exactly going the reasonable way at the moment.

H&K has an US trademark consisting of just "MP5" in relation to a ton of things (though not video games!) so they could at least try make a case out of it not being purely nominative use and tie AB in court, if they wished. It would be PR suicide, but still, not the most stupid thing they have done.


Unclear. They do absolutely refer to their gun as, say, the "MP5" in-game.

Though, interestingly, in Modern Warfare (2019), many guns have two names; for example, the MP5 is also called SMG Charlie (as in, NATO phonetic alphabet for C). I kind of got the impression that it was laying groundwork for a long-term goal of removing the actual names of the guns; possibly due to licensing fees, or maybe to divorce the ugly reality of killing with video game killing, I don't know.


It feels like it would be pretty bizarre if a court somewhere actually ruled in favor of a gun manufacturer for lost revenue in a trademark suit because somebody was genuinely confused between a weapon in a video game and ordering an actual physical weapon, that can only be legally ordered by licensed firearm dealers and government organizations.


Need for Speed Heat doesn't allow you to put "69" or "420" on your car. But "6 9" and "4 2 0" are fine :D Best filter ever, completely defeated by just spaces


Maybe I'm being dense here, but what possible profane meaning is there in MP5?


You're not being dense. Its inexplicable. The only thing I can come up with is that "5" looks like "S", so maybe its banning "MPS", but even that is nearly meaningless; urbandictionary has some explicit things it stands for, though they're not well-upvoted.


The RIAA is trying to get ahead of various up-and-coming formats that will be used to pirate their content.


The example where the AFA filtered a news article about Tyson Gay, to replace any instance of his surname to 'homosexual' is an hilarious example of why you need context.


And the Arsenal pocketwatch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: