Talk:DailyKos Tag Cleanup Project

From dKosopedia

Jump to: navigation, search

Exchange between jotter and me about tags here: ---SarahLee 14:23, 1 November 2006 (PST)

Besure to read the conversations here:

Daily Kos should become the Blog of Record. The talent available here, the dedication that has been proven over and over again, the volume of information... all of these need to be leveraged and nurtured.

Tags are essential to organizing and finding the information when needed, a coherent strategy must be developed and implemented. There are already excellent guidelines in place, but it is necessary to enforce them.

Ideas: A Tag patrol that can add/ subtract/ edit tags at will. Coding that prevents certain tags from being created (for example, excessively long or too many words)

Here is a solution that will eliminate tag abuse instantly: an audit trail. Just like comments, when you add or delete a tag you should be willing to put your name on it. It needn't be visible like comments are, but if there is a function to show who made the change it will compel everyone to be more responsible. Abusers? Banish them to outer darkness.

I tried this earlier... See: Tag Editors Workspace--SarahLee 21:51, 25 September 2006 (PDT)

I'm sure we can merge the two works together. -- Centerfielder 07:53, 26 September 2006 (PDT)
I still think that using the Tag Editors Workspace is a good idea for listing the tags we want to use - expanding the classification and as a quick and easy place to check for tags to use on frequently used areas of discussion without all of the technical discussion about how to clean up; a place the casual tagger or someone writing a diary can look at to find the tags to use for specific areas of discussion. IMHO This is especially important when a new hot topic hits the news and suddenly a lot of diaries get posted on the same subject. The faster we reach concensus on the tags to use, and get them posted, the easier it is to clean up tags as the diaries are posted and it gives us an immediate link for a page to post in the comments to get everyone on board - like I did with YearlyKos 2006. Thoughts? --SarahLee 08:52, 15 October 2006 (PDT)

My thoughts and offer.

I am a database/computer geek and I do metadata crap all the time. I've built metadata engines for websites before. I've also built tools for word parsing (stemming, etc.) Some of this was a while ago, but it's still in my head somewhere. I'm not a great diarist, but I can shove data around like nobody else, so I offer my time on this.

What absolutely must happen for a project of this scope is to have an interface to do tag deletion/replacement. The scale is simply too much to do by hand. I would propose that DKos get an xml-rpc interface (or suitable equivalent) that allows a privileged user to submit a tag and a replacement. A privileged user can then send a command to the server which would cause the server to find all instances of that tag and replace them with the provided replacement.

With a client side tool that we can write and distribute we can then do the whitespace cleanup and do tag condensing and deletion using local tools. We could request these be server-side tools, but that burdens an already strained DKos staff. I can write client side tools, but we need one (straightforward) server component and some trust from the admins.

Here's how I would tackle this:

  1. protect reserved terms. Regexs for congressional districts, bill numbers (HR 6166), and so on.
  2. whitespace cleanup and automated tag condensing. There are some algorithms for looking for common misspellings, for compound word separation, etc.
  3. tag condensing. By working through the tags from shortest to longest in length, and looking for tags that contain those tags as substrings. So, 'auto' would give you all of the 'automotive' related tags. You then scan those, find the most common one 'automobiles' and then search/replace. This requires some amount of thinking, but the tools automate it by algorithmically collecting tags that appear to be related using stemming (hunking off -ing, -ed, -s), character frequency (automobile and automboile have similar character frequencies) and so on. The tool would present you with options, make it easy to pull them together, select the preferred term, and then walk through and send off the replacement request to the server. It's still work, but you focus on thinking about the problem and not all the tedious clicking and typing.
  4. find one-off tags that are older than a certain period of time (time should be provided for new tags to be adopted) and have them deleted if they don't condense to existing terms. Some peer review could happen on this.

This gets done in phases. Do a reasonably complete job on #1, then grab the tag data again and do #2, and you keep repeating. A round of fixes, a reload, and so on. Even with 50,000 tags, I think you'd be amazed at how quickly a single person could clean this up in a matter of a few days.

I wouldn't divide up the tags into subsets due to item 3 - it works best on the whole set. #4 is where you'd break things up to help with the review.

If this sounds possible, I'd offer to take this task on and start building client-side tools that we can use. We'd need the server-side component and some conversation with Kos for permission to do this in a comprehensive way.


The above offer is from johnsonwax? -Halcyon, 9/28/06 13:16 EDT



Your expertise may be what can find the solution to the entire project. I suggest you conference with dKosopedia, kos, ct and whoever the other Admin people are to see if your approach will do the trick, without need for a time-consuming consensus committees here. You would also be able to collect on kos' offer of a paid job. I do think that it would be prudent to have a professional Librarian team create the Approved List from which your software can work on eliminating/condensing the cloud. Volunteers can still be used to manually work at whatever software can't. Halcyon 9/28/06 16:54 EDT

Halcyon's original main page ramblings, moved over here

This is the text that Halcyon had originally put at the top of the Tag Cleanup Project home page:

September 26, 2006 To start: Segmenting this project into manageable chunks, then attracting volunteers willing to adopt chunks. Definition of volunteer duty: 1) Initial duty, 2) Ongoing responsibility to police one's 'chunk'. 1) Initial duty is simply reducing/removing/consolidating frivolous, duplicative Tags. This requires TU status, and judgment calls, willingness to take the time to read the diary in order to make the best determination about Tag assignment. It's easy to get dragged into attempting to clean up all Tags in each diary accessed from a particular one-off Tag. Don't get caught in this trap. Our first task is to delete the one-off Tags, replacing them, if necessary, with a substitute. Most diaries have multiple Tags that suffice for finding the diary's unique content. One-off Tags are most often not necessary, or a product of the diarist's creativity or ego, or the result of a typo, or alternate part of speech (gerund/noun). It's OK to be ruthless in cleaning up Tags. So, if you've volunteered to winnow a chunk of one-off Tags, don't bother about the other Tags in the diaries during our first phase, so long as you feel that eliminating the one-off Tag does not relegate the diary to purgatory, where it will never again be retrievable by future historians. I will divide our task alphabetically and request volunteers to adopt a letter. Those letters (C, M. P, S) that are too large for a single person will be subdivided.

Volunteers are encouraged to post comments in diaries pointing out Tag alterations, with explanation, and encouragement to help out with our effort. Many other readers will see these comments. Frequent diarists should be encouraged to hone their own Tagging skills to reduce our need to clean up their Tags. This is my first-ever Wiki entry. All edits/contributions welcome. -Halcyon, 9/26/06, 1:32AM EDT

September 26, 2006

One can quickly get bogged down, overwhelmed, and burned out without a simplifying algorithm that can be followed methodically, that reduces the need to read each diary in order to analyze how to best clean up its Tags. I find myself cascading from one task to multiple others, as I see other unfamiliar Tags on a diary I had opened for the purpose of merely removing one Tag. I suggest selecting a letter (C) and working on eliminating the one-offs that start with 'C.' Of course there are too many that start with 'C' for one person to handle, so elect 'Ca' 'Ce' 'Ch' 'Ci' 'Cl' etc. When opening a diary in order to remove a particular one-off Tag, don't worry about fixing all the other problem Tags you may notice. Just move on, although, as long as you are accessing the edit window, you might as well correct typos and misspellings, add, e.g. 'George W.' to 'Bush'. These are relatively easy and mindless and won't get you bogged down. It's having to think and agonize over making a judgment call on someone else's work that tires my brain. I think we can make great headway eliminating one-offs if we focus only on those during a 'first phase.' Most diaries have enough other Tags so that eliminating a one-off won't consign the diary to anonymity. But if the one-off is a variant of a standard Tag, then replace it with the pre-existing Tag. This will grow easier as we all become more familiar through experience.

An alternative approach, for those who don't want to use the alphabetic one is to select a personally appealing problem area or topic, and consolidate a related grouping of Tags: example: change 'voter rights' (only a handful of diaries at this Tag) to 'voting rights.' Change 'election protection' (only a handful of diaries here) to 'election integrity' an accepted standardized term that covers this concept more broadly. Combine 'Tax' and 'Taxes' by changing the fewer into the more numerous form. Someone who likes names can correct the wrong name Tags, such as 'Madeline Albright' to 'Madeleine Albright.' What do we wish to do about Tags such as 'malevolence' and 'cruelty' which have several diaries attached?

Simplest Approach For Removing One-Off Tags:

The simplest approach, of course, is to load the 'All Tags' page in 'Popularity' mode, and spend a few minutes working from the bottom up. Try not to bog down in the other Tags on each diary, except to correct obvious typos, or a Tag you're so familiar with that you can do it without having to think about it or read the diary. Just try to work in a way that allows you to quickly move on. It's doubtful you'll upset anyone or make any big goofs when eliminating one-offs. If it were important, it wouldn't be a one-off. When you find Tags with no diary attached save it as a bookmark until you log in here, and post it for Admin to remove.

Questions: Can the programmer devise URLs for each letter of the alphabet? It can take minutes to load the 'All Tags' page and it's a pain to work with for this kind of project. Another request for the programmer: Please rewrite the software so that when one edits Tags, the Permalink/'storyonly' reloads instead of with comments. Does anyone know what logic determines the order in which the one-offs are listed? It's not alphabetic or chronologic. Hmm. We should have a box where orphaned tags can be posted for Admin to delete, such as 'Palo Alto iran iraq ashraf california'and 'poverty terrorism africa al qaeda foreign policy'. [I tried to link those two Tags to their URLs, but it didn't work using brackets and a space.] How much of a problem are the orphan Tags? Should the programmer create a way to make the one-off Tags delete automatically when someone deletes their diary? -Halcyon 9/26/06, 14:05 EDT

September 27, 2006

This project requires a combination of computer/database geeks and librarians to create a structure for the Tagging system. I am neither.

Today’s installment of my feeble attempt to get this project moving:

We have two tasks: 1) winnowing down and standardizing an Approved Tag List; 2) Creating a new system for Tagging diaries that minimizes future cloud explosion, and the software changes to implement this system.

The programmer should remove all one-off tags automatically on a date to be determined, following fair warning and a grace period so that people can clean up Tags on their own diaries.

There are also many two-off and three-off Tags that could well be deleted or replaced with pre-existing Tags. These will require human hands. Programmer, at direction of librarian could meld a few of the humongous but synonymous Tags, variant spellings of same subject. Someone, or a group, will have to separate out all ‘Bush’ -Tagged diaries to their appropriate place on the ‘George W. Bush’ or George H.W. Bush’ Tag. Same for ‘Clinton’ and all other last-name-only Tags.

There’s a whole bunch of important Tags that, though expressing the same concept, use variant terms or parts of speech that need to be combined. This is where librarians’ skills are best suited .

If the winnowing process results in a Tag cloud that is loadable on a dial-up system, then we’ve done our job. (Right now this is impossible, and even with some computers on cable service, it takes 2 minutes to load, and bogs one’s browser.)

If the cloud remains rather large, the librarians could organize the approved Tag list in a way that major categories can have their own URLs for ease of search by diarists seeking approved Tags for their diaries. All election Tags (CA-19, PA-Sen, OH-Gov; election integrity, Diebold); 9-11 Tags; Iraq Tags. Librarians probably know how to select the broad categories for separating out classes of Tags.

Programmer(s) will create a method for Tagging diaries that prevents willy-nilly creation of new Tags, and eases Tag selection from the approved list.

The method for creating new Tags, as events unfold, must be worked out by the Tag Team, rules drawn up, an approval/implementation committee volunteered to take on the task of entering new Tags.

There is still to be determined who, besides the diarist and Tag Team, will have status/access to adding/editing Tags. TUs, who have more familiarity with Tags are often able to add Tags to a diary that hadn’t occured to the diarist, but which help to connect the diary’s content with related subject areas that would help future historians unlock the mysteries of our time.

SarahLee created a Tag Editors Workspace under her own initiative last year. Please read her suggestions. It’s here:

Is this Wiki page supposed to be a discussion board, like an ongoing diary? I hope some of you who participated in dkosopedia’s diary on September 25-26 will come on over here and pitch in. You are the ones who pointed out the computer geek/librarian angle. Some of you volunteered to help out. Heck, Kos himself even offered to pay somebody good money to solve this problem.

Am I using this board correctly? Since I’m the only one posting right now, and I see the categories listed below in the contents, in separate sections, which I assume dkosopedia put there, I wonder if I’m doing the right thing, talking to myself up here? -Halcyon, 9/27/06 16:30 EDT

--Nope, I am reading and working - I check the page and download new lists to work on every few days. Spent almost 6 hours on clean up yesterday. --SarahLee 13:02, 18 October 2006 (PDT)


I am unable to download from I keep getting a 404 page. Otherwise I would get rid of those five entries. --SarahLee 13:02, 18 October 2006 (PDT)

Personal tools