DailyKos Tag Cleanup Project
From dKosopedia
Each day's update of tag data can be found here. If you'd like to contribute to this project as a Tag Librarian, please put your name on the Tag Librarians page.
Standard Tags Index: more than 1500 of the most frequently used tags are on the Standard Tags - alpha list page, (last updated on 09-20-07). Other work creating "standard tags" in hierarchical lists can be found at the Tag Editors Workspace.
Quick Cleanup Jobs: If you want to get started on helping cleanup tags with some simple "assignments" you will find a short list with all you need to know by clicking here: Tag:cleanup jobs
Remapping work: An on-line editable spreadsheet has been created for listing tags for remapping and is available through Google or editgrid - contact me through my talk page here or via a reply to one of my comments in Daily Kos if you want to have editing rights. (anyone can view, but editing limited to known Daily Kos users)
Centerfielder also created a prototype dKosopedia page for this kind of work. Sample here: Tags:Bush Family --SarahLee 17:32, 10 November 2006 (PST)
Contents |
Introduction
A discussion of the problems and possible solutions seems like the best first step.
September 28, 2006
The Tag Cleanup Project needs to be separated into discrete tasks, with time priorities attached (stanching the cloud expansion ASAP), and persons assigned to accomplish each task. So, lets start identifying the tasks:
1) Temporary software patch to limit Tag selection, yet allowing for new Tags for new events/people in the news. Who can do this? Jeremy? johnsonwax? dmsilev? Will we need a temporary Tag referee to moderate new Tags? We don’t know how long it will be until the permanent fix goes into operation. New Tags are being created at the rate of over 200 per day. Instead of a referee, the temporary software could have the feature suggested by Fran for Dean: Permit people to create their own Tags, but drop the one-offs automatically at 30 days of age. This would still perpetually keep a constantly-changing batch of 6,000 one-offs in the Tag cloud, bogging the Librarians’ winnowing progress. 6,000 is roughly equivalent to johsonwax’s suggestion that ~5,000 Tags is sufficient for the Approved Tag List. (johnsonwax is a database expert).
2) Scrub the 30,000 one-offs. There are some who would like to find a way to salvage the Semantic one-offs by reassigning them as far as possible. But, unless someone has a software program to do this, who has time for this? I think it’s better to give a warning, grace period, then scrub.
- There are algorithmic ways to do this, mostly be relying on redirects at dkosopedia itself as noted below.
3) Team of software/database geeks to design a longterm software fix that will limit total number of approved Tags to ~5,000. The parameters for the Tagging software need first to be decided, using some Librarian expertise and Tag Cleanup Project participants’ input as well. Who wants to work on suggestions for the features of the new Tagging software? SarahLee? Fran for Dean? johnsonwax? dmsilev? grndrush? Musing85? musicsleuth?
4) Librarian team to standardize and clean up Tags, and produce the Approved Tag List. Librarians can make lists of specific clean-up tasks for volunteers to adopt. Example: There are 15 Tags containing the word Recruitment or Recruiting. All are about Military Recruiting. An experienced Librarian would know the single best term to select, and then should assign the task of consolidation to a volunteer, via a list on the main page of this Project Wiki. As a volunteer completes this task, (s)he removes it from the list.
The Approved Tag List should be created/winnowed by professional Librarians. Any volunteer could then create his/her own list of Tag-cleaning operations by doing a word search on the AllTags page for variants , such as in the Recruiting example. Is this something that a computer program can accomplish meaningfully? We have two self-admitted librarians: Musing85 and musicsleuth. Any takers? SarahLee?
- This is a a total waste of time. Editors here are already cleaning up approved terms for things all the time. Why have a redundant and incompatible effort? It's a mistake. The solutions below are better.
5) Volunteer Tag Cleanup crew can start consolidating Tags now, beginning with typos and misspellings; first name/last name protocol; alll variants on House seats, governorships, etc. can be standardized. Jeremy can remove ‘the’ and ‘a’ as the first word in a Tag: Example: ‘The Path to 9-11’ –> ‘Path to 9-11’
This is my limit for today. This is only my first stab at organizing this project. Please make suggestions of tasks/priorities that I’ve overlooked. I’m not very expert at project management or using this Wiki format. Maybe someone else (dKosopedia?) can rearrange this into a more user-friendly format. Once we have our separate tasks teamed, then there could be a separate working page for each team?
I’ve moved my previous ramblings to the Discussion page.
I need feedback. -Halcyon, September 28, 2006 13:25 EDT
October 11, 2006
I'm currently prototyping a tool for automating this. I emailed Kos and indicated that I may submit a proposal if this goes well.
The tool takes the entire tag list and allows the user to select a subset to work on, select a tag, find related tags to condense, and then send a command back to the server to do search/replace against diaries and to update the tag database.
The first step is to pick which tags need to be worked on - one-off tags, new tags, pending tags, etc. The user select the tag to work on and the software offers a variety of different tools to help the user determine what should be done. There's no magic bullet here. There are a number of techniques that get used to pull together spelling errors, tense issues, substrings, and so on.
For example, one approach is to work through the tags from shortest to longest in length (3 character, 4 character, etc.) and for each tag review all other tags which contain it as a substring. 'bush' is the worst offender here. There are about 450 tags that contain 'bush' as a substring. The vast majority would collapse down to George Walker Bush based on the dKosopedia page, leaving the original tags as redirects to this one. But not all: 'ambush' shouldn't resolve to GWB (in fact it should probably be deleted). 'Bush v. Gore' should be retained, though probably more clearly as the election lawsuit and decision. 'Bush family' should resolve to the dKosopedia page. Jeb, Jenna, not-Jenna, Laura, mom and pop, Prescott, Neil and all the little cousins and such all need to get sorted out as well. Those 450 probably whittle down to about a dozen or two, but represent over 5,000 diaries that need to be retagged. Once that pass is done, you have another pass for spelling. There's probably not many where 'Bush' is spelled wrong, but there's no fewer than a dozen spellings of Al Qaeda and substring searches won't clump them. The soundex match generally will, as will others. A stemming pass may or may not help whittle the list down, but we've got a bunch of 'indict' variants, and tense variants don't always present as substrings. Taking these techniques as a whole allows the user to do a LOT of damage in a reasonable amount of time.
Anyway, the tool is coming together, though I've gotten busier this week than I expected. My goal it to run the entire database through the tool and see roughly how much it will condense the tag set with obvious efforts, and how long it would take. Based on that, I'll have more detailed recommendations to make as I expect it'll present some issues that I hadn't considered. I'd also like the tool to help us with some collaborative efforts at normalizing the tags. Not all tags will be easy to clean up and not all approvals will be obvious. I'd like the tool to make it easy for the user to push (copy/paste) a list to the wiki for the librarians to consult on. --Johnsonwax 00:57, 11 October 2006 (PDT)
Suggestions for Programmatic Cleanup
- delete leading whitespace from initial tags (ongoing)
- delete currently orphaned tags, i.e. those with no diary (one-time)
- delete a diary's tag when that diary is deleted (ongoing)
.......
ct deleted the orphan Tags on 9/27/06. He says he's working on removing the whitespace. -Halcyon, 9/28/06 16:56 EDT
I think that the orphan tag behavior should be automatic in software. A deleted diary should clean up it's own orphaned tags. I see no downside to this. --Johnsonwax 22:35, 10 October 2006 (PDT)
Suggestion: Use dKosopedia's internal DB for tag approval and cleanup.
Instead of maintaining a separate Approved Tags List, and manually deciding which variant forms should be replaced with which, this information could be derived automatically from dKosopedia's existing article database.
- Any tag that matches the title of a dKosopedia article or category is automatically approved.
- Any tag that matches the title of a redirect page gets replaced with the title of the destination article.
- New tags can be approved by adding corresponding stub articles to dKosopedia.
- Very definitely the right plan. Removes the redundancy, and encourages DailyKos users to come over the dKosopedia to sort out the persistent tag list by editing articles.
For example: 2003 Invasion of Iraq would be an approved tag, as there's a dKosopedia article of that name. Invasion of Iraq, Gulf War II, Iraq War, Iraq war, and War in Iraq tags could be automatically replaced with 2003 Invasion of Iraq, since those dKosopedia pages all redirect to the first page. Bush's war on Iraq, on the other hand, would not be an approved tag; it could instead be flagged as "pending". If you see it and decide to merge it with 2003 Invasion of Iraq, just add a matching redirect page. Similarly, 2006 Invasion of Iran would not be an approved tag, unless and until someone created an article with the same name.
This method would leverage the existing information in the dKosopedia, keep the process open to everyone, and be relatively easy to implement. And it would encourage people to contribute to dKosopedia at the same time. —Abou Ben Adhem 23:38, 2 October 2006 (PDT)
- It's absolutely the only possible answer.
I very strongly recommended this for several reasons (some already noted above):
- You get a free community editing/approval tool. That's no small thing.
- Ambiguous tags will automatically sort themselves out here. You'll get fully correct proper names, etc.
- By using this as a means of adding tags, you encourage users to write up a topic for the dKosopedia. If nothing else, that should be enough to justify adding the tag. Having a vibrant dKosopedia should help provide 'glue' for some of our diary topics.
- I will propose that the DKos tags in diaries have a means of linking directly to the dKosopedia page. That gives users a way of getting more information on a topic which will lead to more accurate tagging. Carrots are good things.
- I will propose that the dKosopedia pages link back to DKos diaries with that tag. So, George W. Bush would give up a list of the most recent 15 (for example) diaries (that's the standard for the RSS feeds) on that subject. For wiki folks, you get something of a current event reference for things that should be added to an entry.
- Most of the above - particularly the DKos - dKosopedia links will substantially help with page rankings in search engines.
- A few more advantages:
- You encourage more DailyKos users to edit dKosopedia, creating more trained trolls to go take over Wikipedia, Sourcewatch and other widely consulted mediawiki based services.
- You will have removed all redundant efforts to create incompatible lists of tags elsewhere
- You will have created something that will help steer the entire blogosphere into using the wikiverse as its source of long term data, which definitely needs to happen.
- As noted, "you get something of a current event reference for things that should be added to an entry", but not just for the authors, for the readers as well - reports then can be automatically generated, and emailed out if someone gives permission, listing the blog entries and articles relevant to any news story.
Downside is that it will shift some of the cleanup from DKos tags to dKosopedia to determine if the stubs are appropriate. I don't see this as problematic, though. It'd need to happen in one place or another.
- It's a plus. DailyKos is not a medium suitable to working out long term definitions.
DKos software would need to be data-aware of the wiki subjects and redirects. It will also need it's own tags and redirects. That's a bit complicated - especially if a collision occurs. Consider if the wiki directs 'bush' to 'George W. Bush' but the DKos tag database directs 'bush' to 'Jeb Bush'. We'd need to outline a protocol for this and it'll depend on how the internal tag database is implemented. I'm still pondering things with that. --Johnsonwax 22:35, 10 October 2006 (PDT)
- Cross-site tag collisions
- For collisions like that, I would propose that the redirect page on dKosopedia be automatically converted to a disambiguation page pointing to both articles in question. Then when DailyKos taggers try to use a tag corresponding to a disambiguation page, they're presented with a dialog asking them to choose between one of the tags/articles that page links to. (On second thought, this might be tough since MediaWiki's DB structure doesn't distinguish disambiguation pages from normal pages like it does with redirects. I guess you could check for links to Category:Disambiguation... ) —Abou Ben Adhem 20:27, 11 October 2006 (PDT)
- Another great idea. It's not that hard to detect disambiguation pages, they have marks on them as you note.
--
- I just don't see how this would ever work and it would require massive work on older diaries. Blog people just are not going to come over here to try and find out which article titles already exist in order to create the tags for their diaries. I've been editing tags since we started using them and I may not understand what you all are contemplating programically, but if it isn't really easy, it won't work. And you can forget about 80% of the Daily Kos membership ever coming here - some may come to look something up, but the vast majority are not going to do that if it is very difficult.
- Tag editors also have to be able to react quickly to "breaking stories" and try to develop a set of tags that can be used semi uniformally across the diaries as the stories develop.
- --SarahLee 22:11, 24 October 2006 (PDT)
- Yeah, it would work a lot better if it didn't depend on Kossacks having to come here on their own initiative to effect tag cleanup. But the tag interface could be set up on the Daily Kos end to send tag editors here automatically when needed. For instance, the normal tag list after a diary could be followed by a list of "unverified" tags -- visible only to TU's, the diary author, and any recognized tag cleanup crew. Each tag link in this list could open a new window with a new dKosopedia edit page for that tag. All the tag editor would have to do is add a brief description (or a stub template, or a redirect), save, close the window, and be back in Daily Kos.
- And in addition to the "unverified" list, there could be an "ambiguous" tags list (for tags corresponding to existing disambiguation pages here). Each of these tags would be followed by a pulldown menu of alternative tags derived from the article links. In this case, tag editors could correct the ambiguous tags without ever visiting dKosopedia or changing anything here.
- —Abou Ben Adhem 00:00, 25 October 2006 (PDT)
- Wikipedia has no problem correctly naming things pretty much in real time. Just follow their lead. As for the older diaries:
- Older diaries are always going to be a legacy problem. And when this works it will be quite easy to add an automatic check against a list of approved tags as part of the Scoop software itself. So it's that which will check if the article exists on dKosopedia itself or on Wikipedia. A cute little popup could even list all the redirects in case any of them are semantically different. -- User:egmod
- The 80% of Daily Kos that writes only to amuse itself is not very important in the long term. It is the 20 per cent whose work is worth keeping and gets archived and edited on dKosopedia, that the system really needs to support. In many ways the copying over to, and tagging of, the good worthy material over to dKosopedia, is the whole point of Daily Kos. Who really cares what people think of an average user's opinion of a candidate who may lose anyway, ten years from now? The material you want to be sure that they find, is the material that you invest the time in.-- User:egmod
- Oh, and don't forget Sturgeon's Law! -- User:egmod
- As to usability: if only the silly requirement to login would go away, it would be far easier to edit this page ("click edit, type what you think it SHOULD say, and then click save") than to create a diary or even to participate in a mailing list. This is why Wikipedia was so successful. They know that logins suck and often cause more problems than they solve.
Issues
Tag Case
Should tags be all lower case, all upper case (no way), or mixed case. Mixed case seems correct for proper names, but for single words ("Election", "Tags") I'd go with lower case. come to think about it, current treatment ([1], if this is correct) isn't a bad compromise. This needs further discussion.
.... By current treatment, after reading the linked thread, I assume you mean: leave well enough alone: 1st person to create a Tag sets the cap/non-cap, and the Tag software is thereafter case-insensitive. That is reasonable. However, the Librarian Team, while creating the Approved Tag List, can standardize all cap/non-cap usage. Halcyon 9/28/06 15:02 EDT
- No they can't: obviously they are using bad caps like "Approved Tag List". ;-)
I would put case as part of the cleanup effort. Tags should be lower case by default except proper names and acronyms. Intercapping should be done as appropriate. I see no evidence in the tag list that anyone punctuates acronyms (F.E.M.A. vs. FEMA) so I would suggest leaving well enough alone. We should consider if these should expand to the full term when typed in: 'Federal Emergency Management Agency (FEMA)'. Not all acronyms are obvious to all users. If we leave it to dKosopedia to resolve, FEMA would redirect to US Federal Emergency Management Agency. I think I would prefer to see common abbreviations in the tag itself in case a user searched against it on the page, but I could go either way. --Johnsonwax 22:49, 10 October 2006 (PDT)
- Definitely right.
Tag Separation
"," versus ";" versus "|" versus whitespace. I (The Centerfielder) would suggest the following:
- Tags may only be created using the following character set [ a-zA-Z0-9,\.-] Yes, I would include the comma, so that "Martin Luther King, Jr.", for example, is allowable. Right now a tag like "9/11" gets transformed into "9-11".
- A single space is allowable in a tag (it's right there before the "a" in the character set definition; HOWEVER, when the tag is defined the space must be replaced by an underscore ("_"). Notice that underscore is not in the allowed set of characters. On diary submit the underscore can then be replaced with a space.
- A tag separator, therefore, is anything NOT in the set [a-zA-Z0-9,\._-].
I don't see the need to do the underscore replacement. The tag database itself doesn't need it, and in some ways it's better to have the space there. There's no ambiguity in a delimited list.
- Underscore replacement is standard in mediawiki, please don't fight the software. Also in a search engine it's very easy to find only the right pages if there is a standard token George_W._Bush. That isn't going to appear on any pages about other bushes or other georges. Also people should use Realfirstname_Reallastname as an ID, for further certainty, if they want to be noted for what they contribute. Nothing should be attributed to a person unless they use this form of name!
If we use dKosopedia titles as reserved tags, you have that character space to work within. Since commas regularly show up, I think we need to give serious thought to using it as a delimiter. A semicolon is probably much safer. The pipe would be fine and is virtually guaranteed to not show up in a title.
Can anyone find (or dream of) a case where a semi would be appropriate in a diary title? --Johnsonwax 22:59, 10 October 2006 (PDT)
- Follow Wikipedia: if they do it, you do it. If not, not. You can't fight them either.
- What about carriage returns—I think those would be the least confusing for users. But if that's a technical problem, I'd go for pipes second.... —Abou Ben Adhem 21:02, 24 October 2006 (PDT)
- I vote for semicolons. It has to be something users are accustomed to using and most do not use pipes so will make a lot of errors. It would be hard enough to move them from commas to semicolons; just about impossible to move them to pipes --SarahLee 08:26, 27 October 2006 (PDT)
If "a tag separator ... is anything NOT in the set (of allowed characters)", then we don't necessarily need to decide on a single accepted separator, right? We just need to decide which characters shouldn't be separators... --Abou Ben Adhem 14:45, 27 October 2006 (PDT)
Proliferation
The tags are multiplying like little fuzzy bunnies. Suggestions:
- There will exist a list of "Approved Tags". These are the only tags which you can attach to a diary without fail.
- Attached tags not on the "Approved Tags" list get put on a "Pending Tags" list. These "Pending Tags" can be approved in one of two ways:
- Human intervention - some group of people, the "Tag Librarians," if you will, will have the power to approve a tag. This group is not the same as the Trusted User group.
- Programmatically - To handle rapidly breaking news -- let's say we're invaded by Venusians -- pending tags can be approved automatically if a certain number of diarists start using it. This approval can be rescinded by the Tag Librarians.
Suggestion for allowing edit/ add/ delete by other than the diarist: I think this solution would eliminate tag abuse instantly: an audit trail. Just like comments, when you add or delete a tag you should be willing to put your name on it. It needn't be visible like comments are, but if there is a function to show who made the change it will compel everyone to be more responsible. Abusers? Banish them to outer darkness.
- Definitely the right idea, but forget this fascist stuff about attacking people who put tags on you don't like. Just fix it, like any wiki page. It's just not a problem, and if it becomes one, then, restrict tag editing to people using Realfirstname_Reallastname as their login.
If we get a good cleanup tool, then I don't think we need to be too draconian on tag proliferation. I think if users are frustrated trying to properly tag a diary, they'll stop doing it.
I think this should be a tiered approach:
- Approved tags will be maintained - likely both the dKosopedia titles and a separate list. New tags can be added by adding a page here.
- Redirects to approved tags will be maintained and resolved dynamically. We should let users type 'bush' and get the right tag for free. Some should redirect to null - basically become prohibited tags. I don't see a good case for 'dumbfuck' being a tag, even though we've got 3 diaries with that tag.
- I think we need to trust diarists to tag as they need. Diarists can add tags freely and if a tag is reused a certain number of times in a given period of time (3 times for example) then it gets treated as a pending tag. TUs can use it, but a librarian should still review it at some point to see if it's redundant with an existing tag or dKosopedia page.
- Only librarians can approve tags to the non-dKosopedia reserved list. Any that are added programmatically are only pending. It's far easier to approve a small list of additions than to rescind a tag that's difficult to pluck out from the entire corpus.
TUs can only add/remove approved and pending tags. If they think a new tag should be added, they can ask the diarist to add it much as we ask for an edit to take place.
My vision for the editing tool is that you could quickly review pending tags. Essentially, I envision all non-dKosopedia tags being manually approved at each pass. Consider that I see the standard tag list becoming really quite small - a couple thousand entries, so new tags really shouldn't be that big an issue and shouldn't take very long at all to approve through with a decent tool. --Johnsonwax 23:23, 10 October 2006 (PDT)
Personal Names
Add a new 'hint' for tags to include first and last name of the person of interest. Two important examples, specify "Bill Clinton" or "Hillary Clinton" not just "Clinton". Likewise, use "John Edwards" or "Donna Edwards" not "Edwards" -- the issue of whether or not to use a nickname for the first name may need to revert to common usage (i.e. Bill Clinton is more common than William J. Clinton, however George W. Bush vs. George Bush vs. George H. W. Bush vs. George HW Bush...)
I would suggest that full proper names be used. So I would defer to William Jefferson Clinton and set up the others as redirects. By using dKosopedia as the reserved tag space, this becomes pretty trivial to do. At some level, it's likely we'd want 'Clinton' to redirect to William Jefferson Clinton and 'Hillary' to redirect to Hillary Rodham Clinton - at least at DKos, but maybe not here.
- Absolutely the right approach. Also "Clinton" will change based on when the entry appears. In 1993 it was probably Bill. By 2010 Hillary may be President, and it'll be her.
We need to be diligent that we don't turn this into nothing more than a hyper-tag database for DKos. It needs to make sense on its own. If DKos should work a bit differently than redirects here, then we should make sure that happens over there and not force it to happen here. --Johnsonwax 22:40, 10 October 2006 (PDT)
---I think expecting people to type in William Jefferson Clinton when we can barely get most to type in more than "Clinton" is a bit much. He is generally referred to as "Bill Clinton" so that should be good enough. --SarahLee 23:16, 23 October 2006 (PDT)
- That is good enough, the redirects solve the problem.
- If a tag/name resolves to one thing here and another thing at DKos, it should probably be changed to a disambiguation page here and something analogous there. I can't think of a case were people would automatically assume that that same word meant different things here and there — if it's that ambiguous, neither site should make assumptions. —Abou Ben Adhem 00:30, 24 October 2006 (PDT)
- Absolutely right.
Names with only one tag
In cleaning up tags that have only one entry, I've worked on the premise that so long as I verify that the tag is spelled correctly and includes both a first and last name commonly used, and that another, already used tag for that same person does not exist, I am leaving it. We never know when a name will pop up again in the future and being able to see where they came up before could be helpful from a research point of view. Same for organizational names. I don't think there should be a time limit on this. This is different from how I manage other words used as tags one time. --SarahLee 09:06, 24 October 2006 (PDT)
- Such tags could be the first to be standardized Eddie_Jones or whatever. After all, if they get notable enough for Wikipedia, their article will be at en.wikipedia.org/wiki/Eddie_Jones - and Eddie won't object to the underscore.