Dec. 9th, 2006

robbat2: (Default)

I have returned from my brief honeymoon. I'll write about it in more detail soon, but for now, a point outline only:

  • Fri Dec 1st - Travel and the Taxi driver who didn't know where the hotel was.
  • Sat Dec 2nd - In which we acquire bicycles.
  • Sun Dec 3rd - Waterfalls, lava tubes, and a close encounter of the paved kind.
  • Mon Dec 4th - Recuperation - ouch! everything hurts
  • Tue Dec 5th - Replacement bicycle, and exploring Hilo.
  • Wed Dec 6th - The finding and snorkelling of a mismanaged reef.
  • Thu Dec 7th - Boarding pass SSSS's & social engineering.

Email statistics
Here are statistics on new email that I recieved while I was away. This excludes all mailing-list email, which is not subject to spam filtering as the lists are extremely clear of spam, and my procmail rules shuffle the email into seperate folders quite fine. That would add another ~3000 non-spam emails into the count, but are not really relevant to spam categorization success rates.
I have my spam settings reasonably conservative, as I don't mind deleting spam that makes it through the filters, but false positives are a much larger concern.
total new messages: 2191
total spam: 1771
false positives: 1 (0.045% of total)
false negatives: 446 (20.3% of total, 25.2% of spam)
The false negatives are getting very interesting now. Random chunks of online documents, incl sentances from the document used as subjects, with an attached image as the actual spam, or cleverly merged HTML+CSS that would render the spam text over the other text. Two of them appeared to be chunks of the MySQL documentation.
The gentoo mail aliases like mysql-bugs@g.o appear to be very badly hit with spam, accounting for nearly 70% of the false negatives - this is also possibly because I have to trust the relaying of the Gentoo email servers, and cannot check the machine that the email came from.

robbat2: (Default)

I meant to get back to doing more statistics on Bugzilla, but it fell by the wayside. The following is mainly for completeness, and the interest of those as to why Bugzilla has been so bog slow for Gentoo in the past.

First of all, I had some questions as to why I focused on specific actions in Bugzilla. The truth of this is, that we can break down Bugzilla's usage of the database into three specifics:

  1. Changes to bugs (INSERT, UPDATE)
  2. Loads of specific bugs and attachments (SELECT with a primary key)
  3. Searches for bugs (Complex SELECT)

Unfortunetly, the usage patterns are heavily against Bugzilla here. Searches for bugs using some string plus a variety of conditions are the most common action. Benchmarking slow queries? That's pretty much any of the complex SELECTS. "Add more indexes" I hear some people shouting. The indexes are already nearly the same size as the actual dataset they index (400mb of index for 500mb of data)! There is an index on every field that is used for searching! One of the problems is that mysql trashes it's caches on UPDATEs and INSERTs in many cases, so spends a lot of time reloading them.

Bugzilla could massively benefit from an external text indexing system like Apache's Lucene, that can handle live modifications to the index without wasting anything. Changes are fed realtime to the index, and searches for text are performed against the dedicated index (which can also be parallized easily).

More numbers

Stuart asked for some more actual numbers, so I've put them together.

Breakdown by Request Type
TypeMeanMaxMin
Total GET53809 60409 45094
—Static GET 35401 39774 28797
—Dynamic GET 18407 20635 16223
Total POST 1394 1569 1106

Graphs below the cut, hidden to avoid spamming the page )

May 2017

S M T W T F S
 123456
78910111213
141516171819 20
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags