I thought I knew a fair bit about MySQL before I started working on Wesabe, but I was wrong. Once your tables start to get really massive, all sorts of fun issues come out of the woodwork. There are a thousand articles out there on the basics of database optimization (use indices, denormalize when necessary, etc.), so I'm not going to cover those again. Here are some tips--most specific to MySQL--that I've learned on the job:
MySQL only uses one index per table when doing a query, and the query optimizer doesn't always pick the best column, so indexing everything is not necessarily a great idea. If you're always selecting on a certain set of columns, create a single multi-column index
related to the above, don't put indices on columns with a low cardinality. You can check the cardinality of a column by doing SHOW INDEX FROM table_name
if any text or blob fields are part of your SELECT query, MySQL will write any temp tables to disk rather than memory. So where possible, use large VARCHAR fields (on 5.0.3 or higher) instead. See: http://railspikes.com/2007/9/14/one-reason-why-MySQL-sucks
if you need to do case-sensitive matching, declare your column as BINARY; don't use LIKE BINARY in your queries to cast a non-binary column. If you do, MySQL won't use any indexes on that column.
SELECT ... LIMIT offset, limit, where offset is a large number == bad, at least with InnoDB tables, as MySQL actually counts out every record until it gets to the offset. Ouch. Instead, assuming you have an auto-incremented id field, SELECT * FROM foo WHERE id >= offset AND id < offset+limit (props to Coda for that one).
Update:
If you have a query in which you're looking for the most recent n items of something—e.g. SELECT * FROM foo WHERE bar = ? ORDER BY created_at DESC LIMIT 10)—your query will be significantly faster if you create a multi-column index on whichever field(s) you're selecting on plus the timestamp field that you're ordering by (in this case, CREATE INDEX idx_foo_bar_created_at ON foo (bar,created_at)). This allows MySQL to look up the records directly from the index, instead of pulling all the records where bar = ?, sorting them (probably first writing them to a tempfile), and then taking the last 10. EXPLAIN is your friend. If you're seeing "Using temporary table, Using filesort" in the "extra" section of EXPLAIN, that's bad.
I’ll add more as we come across them. If you have any other non-obvious tips, leave them in the comments or email me (brad@ this domain) and I’ll add them too.
I decided to give tumblogging a try. I’m hoping that ease of use and lowered expectations will get me to post more often, and I’m liking it so far. It covers the large middle ground between blogging and twittering. I can put up something interesting without feeling like I need to construct a coherent post around it, and I don’t feel like I’m pestering people with twitters (not to mention that it allows for richer media types).
I realize I could just do the same thing on this blog if I wanted to, but there’s something about the minimalist formatting and ease-of-use that just seems to work. So I’ll continue to post longer pieces here (I’ve got a number that I need to just sit down and write), and I’ll keep posting inane comments to Twitter, but I imagine that the majority of interesting things I find will end up on my tumblelog.
After hearing a couplecomplaints from friends about the amount of spam they’re getting, I decided to take a quick look to see where I stood. At the risk of setting myself up, I should say that I get no spam (channeling John C. Dvorak). Or at least very little–maybe one every other day gets through to my inbox.
I think the trick is just to have multiple levels of filtering. I use a Gmail address for most online transactions (for online transactions that I really don’t care about–if I’m doing a one-time registration just to get a trial key, for example–I use my Yahoo address, which is nothing but a spam bucket). My Gmail address gets forwarded to my personal email address (@footle.org), which has SpamAssassin running on the server. If it makes it past SA to Mail.app, it gets inspected by SpamSieve, an awesome Bayesian spam filter for OS X mail clients.
Anyway, my stats for the last 24 hours:
Server-level filters (spam caught):
SpamAssassin: 281
Gmail: 32
Client-level filter (spam caught):
SpamSieve: 17
Spam getting to my inbox: 0
Non-spam messages: 52
False positives: 1 (just Flavorpill, so not a big deal)
(Wow…330 pieces of spam in a single day. That seems way worse than when I last checked.)
Bonjour-enabled Git. (Although Coda tells me that Dave F. at Powerset already approached him with this idea. Neither of us have time to work on it.)
I started using Git recently and although it’s a bit confounding at times, I love how cheap and easy branching is, and that you can commit incremental changes to your local repository, then later commit them in batch to subversion. Git allows for distributed development, but unless you’re on the same server as other developers, or you go through the trouble of mounting their repository over the network, you can’t check out their repository. Seems like it shouldn’t be too difficult to give Git Bonjour powers.