JobsMeta   JobsMeta News

Shtuff - Here are some of the lovely, lovely people and companies behind JobsMeta. Go visit their sites, and stuff. Progress Report:

20/11/2001:
The searching mechanism is fully functional. The database has been ported from PostgreSQL to MySQL, and the transition required fairly minor code modifications. The main reason for this was speed - on a test I conducted, specifically for this application, MySQL was about twice as fast on the same system. Other databases which have been considered and tested with the system were mini-SQL and InterBase.
mini-SQL died horribly after blowing 1 GB of RAM and went swapping in order to do a join between two 30 MB tables and perform a full text search. It also took a very long time before it died.
InterBase worked fine, and it was reliable, but some of it's features were a bit odd. Most importantly, however, it was much slower than PostgreSQL, even after it was tuned according to the documented parameters (it was given enough memory to keep the whole database in cache, and to still have a disk copy of database on OS cache). It was hence rejected as a viable option.
I have also considered downloading and trying the free trial versions of Oracle and SAP, but by the end of the exercise of installing and comparing the performance of four databases, and porting my application four times, I really didn't have enough energy left to also go and play with something as heavy as Oracle and SAP. Maybe another time, in another application.

Currently polled job sites are: Gis-A-Job, Job Board, JobServe, JobFizz, Job Search, TechnoJobs, Planet Recruit and Work Thing. That just about covers all of the biggest ones. There is still one or two left, and I will add those in due time.

The paging has been implemented. Currently, there are 10 jobs per page, but this is easily configureable. This will be adjusted appropriately in the near future. I will probably need to change the layout of the displayed jobs, as currently a lot of space is wasted - must have a think about what the best way to do this is.

Jobs are kept for 7 days - after that they are deleted. All the retrieval and maintenance tasks have been fully automated. Another advantage of MySQL over PostgreSQL has been that the daily cleanup of the database is much faster. Not that it matters at midnight. The number of jobs seems to hover between 40K and 50K, roughly averaging about 10K per working day.

23/11/2001:
JobsMeta has been moved to a faster server (finally!). It is no longer running on an old Cyrix 6x86 133+ (which has been recomissioned for another job)!

It has been moved to a newer server (although it's nothing to really write home about). It is now on a Dual Pentium MMX 200 MHz, with 128 MB of RAM, and a 6 GB 3-disk SCSI RAID-5. It is only Linux software RAID, but it does the trick nicely. Tested by pulling one of the disks out. :-)

I guess I can sleep a little easier at night, without having to worry whether I have forgotten to change the backup tape. :-)

Tomorrow I'll be moving the JobsMeta database to a new server as well. It is going to be a downgrade, though. It will be moving to the machine of exactly the same spec (identical, in fact) to the new web server. Currently it is on a Dual Pentium III 1 GHz with 1 GB of RAM, but the queries are so fast anyway that it shouldn't make a noticeable amount of difference, and I need the bigger server for something else...

24/11/2001
The JobsMeta database has been moved to the new server. The performance is still more than reasonable, although the difference is fairly obvious. It is comparable to the old setup with a 110 MHz web server and 2 GHz database server, which indicates that the web server faces heavier load, and it's performance is hence more closely related to the performance of the web application. Still, this configuration is very similar in speed to the old configuration was with PostgreSQL.

The MySQL server has been recmpiled, and now it can index the term "c", which is quite clearly useful when looking for a C programming job. :^)

Changed the paging setup to do 25 jobs per page.

Re-wrote all the data extraction modules. Now the download process takes into account the remote ID of the Job being downloaded. If the job has already been entered into the local database on a previous run, the page containing the details of it will not be downloaded again. This should provide a saving in bandwidth of up to 10-250 times during the less active periods. Obviously, when there are lots of fresh adverts being posted, this won't provide much benefit. The downside is that the database gets a SELECT query for every Job to see if it is already there, but this is OK since SELECTs are cheaper than INSERTS, downloads are much slower than local database access, and parsing the pages takes up most of the time, so any saving to be made there is welcome. Overall this solution should be much faster. Hopefully, I will finally be able to turn the update cycle timing down to 5 minutes instead of 10 minutes. Currently it's down to 1 minute, but the only way to make sure will be to wait until next week and see what the benefits are during the heavy periods. In theory, the more often jobs are updated, the bigger the saving, because the more duplication there is likely to be in the latest job lists. Eventually, we should reach a point where the bandwidth used for downloads will go down to a constant.

JobFizz couldn't be optimized in this way because all their latest jobs are on a single page with no separate 'details' page, so the whole lot has to be downloaded anyway.

Added the icons for the software used to create this web site, and the general methodologies adhered to during the design.

25/11/2001
The data acquisition modules have been enhanced further, in order to save bandwidth. It would appear that as of recently, the perl libwww supports 'gzip' and 'deflate' encoded content. All the data acquisition modules have been modified to send the 'Accept-Encoding: gzip, deflate' headers. This means that provided the servers on the other end support sending of compressed data, the pages will be downloaded compressed, thus providing a further saving in bandwidth. This time, the saving should be seen primarily during the busy periods, so coupled with yesterdays modifications, this should provide a sizeable increase in throughput.

mod_gzip has been installed on the Apache web server since this project began, so the speed percieved by the users will not be directly affected by this modification. The percievable performance improvement will come from the more network bandwidth and CPU time being available for tasks that are actually useful, such as performing queries, generating requested pages, and generally serving the user's requests.

Added performance stats to the query page. Now it displays the time that it took to execute certain parts of the request. The results are somewhat disappointing. The query parsing into SQL and processing of the data is pretty fast, but the query times are fairly poor. It seems to vary between around 0.5 and 5 seconds, depending on the number of search terms. I suspect that MySQL will improve in performance in the next few versions, as there is some optimization to be made in the Full Text Search mechanism. But then again, moving everything to a dual 2 GHz server will probably solve this problem, and move the query times down to 0.05 - 0.5 second range. Right now the performance is such that the system will peak at handling around 100-150 hits/minute before the point of goodput collapse. The chances are that the server will be upgraded long before even half of this throughput is reached, especially since a typical query will execute in well under a second. It's only the 'stress' queries I use for testing that take a while.

The more worrying aspect is the bandwidth consumption. A full page with 25 results averages around 100-110 KB. That means that, the uplink can transfer one page every 6 seconds or so, provided the user's is on at least dual ISDN (128Kb/s). This limits the throughput to a maximum of around 10 hits/minute, which is most definitely NOT good. The connection will saturate WAY before the current server will max out. But, things aren't actually that bad. mod_gzip comes to the rescue here. An average page has been tested to compress down to less than 10% of it's size (down to about 9 KB). This increases the throughput we can handle approximately 10 fold, up to about 100 hits/minute. Seen as this is just before the point where the database server will peak, no upgrades will be necessary until the bandwidth is increased first.
On the plus side, the actual parsing and processing times are VERY low, so only the database server REALLY needs to be upgraded.

Changed the icons at the bottom of this page to links to the relevant web sites.

26/11/2001
- The updated, optimized retrieval routines are working VERY nicely. Even during the normal daily peak times, the downloads seem to execute without a fail in under 50 seconds. I have turned the update frequency down to 3 minutes. This should provide a MUCH better catchment of jobs from lists that only include as little as the latest 10 jobs, as they overflow and roll over very quickly. The added bonus in all this is that it allows a much fairer comparison in the number of jobs posted to different job sites. It would appear that JobServe is not quite as posted compared to the other ones as it previously appeared. It looks like all the other sites that are being crawled (7 of them, at the moment) do actually match the number of jobs posted on JobServe. This means that JobServe is 'only' as good as most of the others put together. Before it appeared as if it was twice as good as all the others put together (2/3 - 1/3 split).
- Fixed a bug in how the remote IDs were uniquely indexed. This could have potentially prevented some jobs from being downloaded.
- Enhanced the Agency download module to use the same enhancements (compressed downloads, pre-fetch pre-checking) as the Jobs downloads. This should speed things up on that front, but this was really just for the sake of completeness, as the Agency updates run only once per day anyway.
- Compiled and installed PGCC-2.95.2.1 (GCC-2.95.2.1 + PGCC patches). Re-compiled MySQL with maximum optimizations (-O6 and a few other things). Stripped the MySQL daemon (it shrunk down to half of it's size). The MMX optimizations seem to break things in places so they have been switched off. The net benefit measured in heavier queries seems to be in the region of 1-5%. Not a really astounding improvement, but an improvement, nevertheless. Running MySQL server with --skip-safemalloc (yes, I know that is a stupid idea) actually makes it slightly slower for some reason...

27/11/2001
- Added the Statistics page. Shows the breakdown of different Sources and different Agencies. This is done by a live search on the database and not periodically pre-genned. The reason why the Total figures may not be same everywhere is because the updates are constantly occuring and the database was updated during the time between each of the queries generating the reports was run. Not much else added today...

28/11/2001
- Changed the Statistics page around a bit. It now includes the rolling stats for the week so far (whole database), statistics for the current day (as of midnight) ans statistics for the previous day.
- Had a brief email exchange with some of the MySQL development crew. Apparently, the 4.0.1 release will have a re-worked Full Text Search engine for boolean queries (the search type that is currently used by JobsMeta) which should make it MUCH faster. In the meantime, I'm looking into investigating increasing the number of stop words in order to speed up Full Text Search queries. The downside is that in order to do this I will have to recompile MySQL. Not that that is a problem.
- Just recompiled MySQL after adding a bunch of frequently occuring stop words. It doesn't seem to have made any difference... Oh, well. I guess I'd better wait for MySQL 4.0.1...
- Added the ********* module to the pool. Now I have to investigate if there are any other job boards out there that are worth looking at...

30/11/2001
- Added the 4Weeks module to the pool. It doesn't seem to be doing very much, as it has only contributed 13 jobs today... Quite poor.
- Added some Agency filtering to the upload stage. Some agencies post under several different variants of their name (e.g. with and without the 'Ltd.' at the end). The new filter handles that for some agencies that are contributing the greatest number of jobs. More will be added as time goes on.
- Spend most of the time yesterday looking for other job search sites. There are quite a few of them out there, but most of them are quite severely lacking in content. A total of 20 new sites have been identified that might be included at some point.
- Another thing I have been thinking about is the prospect of somehow filtering out the duplicate messages. This problem occurs due to programs such as WebSalvo and Conkers that allow agencies to post jobs to a number of different search sites. Unfortunately, for JobsMeta that means duplication - potentially quite a lot of it, too. I've been considering ways to filter out the duplicates, and to prioritize adverts, for example, by source (which job site they came from).
- This prioritarization is likely to be initially performed by the rule of favoring the sites with most content. This is primarily for the reason that is is the most quantifiable measurement I can think of. This method is going to be changed as and when I get contacted by people running the job sites (those who appreciate this site will move higher, thus overwriting the lower ranking entries from other sites). Of course, certain sites are likely to get dropped, as and when those in charge of their management decide that they don't want people to find jobs more easily, even if they do get the credit...
- The secondary optimization method is likely to be the time when the advert is posted. The later advert is favoured more.
- Unfortunately, this duplicate filtering is likely to be more difficult than it may seem. It is likely to require huge amounts of processing time to perform reasonably reliably. Every job will have to be checked against every other job. Ideally, this should be done at insertion time, because that way the number of jobs in the database would be kept to a minimum, thus yielding a faster checking time in the average case. Of course, doing that once per day would be much faster and more convenient, but most jobs would loose their "freshness" by then, so it wouldn't actually help for most users.
- Currently I'm thinking about just doing straight word counting. It is a rather crude way of handling this, but it is also quite effective, and it should be fairly fast, with some clever use of Hashes in Perl. The Hashes could even be stored with the data in the database, thus avoiding the parsing of the message evey time - instead it could just be loaded pre-parsed.
- For jobs that appear to be close, some sort of secondary matching could be performed. Wording structure, for example. Use a moving window of two words and compare them. This should be far more effective in finding duplicates. It should be significantly slower than the first method, so this may well get implemented as the only comparison method.
- It should also be possible to break the problem down to small pieces, by separating the jobs by the Agency that posted them. This should make the comparisons execute much faster, as there are much fewer records to compare to each other.
- I will sort this out at some point, when the number of duplicates starts getting in the way. It is likely that I will have to add a few more crawler modules before that starts happening.
- Changed the update timing. During the peak hours, it is set to 1 minute. One hour outside peak hours it is set to 2 mintes. At all other times, it is set to 5 minutes. This should alleviate some of the stress on the servers, and help them get on with more useful work over night.

02/12/2001
- Had an interesting event today - the database server crashed. Not just the MySQL deamon - the whole system. Bizzare. Mind you, it is rather hot in this room with all the computers, and they are clocked a bit out of spec... The web servers also had a problem a couple of days ago. If this happens again in the next week, I may have to clock them down to 166 MHz...
- Having said that, the web server went on overload today with all the data acquisition. That shouldn't have really happened. Could have been caused by network congestion... Either way, it doesn't matter. All these problems will go away when this is all moved to a production server.

04/12/2001
- Argh! The threaded architecture I used for the updating script has proven to be too much for the server to handle. Basically, the updater executed update from each site in a separate thread. With lots of CPUs, and lots of memory, this would have been significantly faster. In the real world of a limited server, though, 128 MB of RAM and 150 MB of swap has been blown during the congested periods. The problem was, as far as I can see, that during the heavy periods more needs to be downloaded and extracted. Certain threads took longer than the update interval to finish, and the parent thread was waiting, for the purpose of consistency - I didn't want to disconnect the parent thread. That means that a parent thread, nearly the same size as the childred was sitting, waiting for the children, wasting memory. Eventually, the point of goodput collapse on memory was reached, and the whole thing just stopped. Eventually, the watchdog just rebooted the server, after the CPU load went through the roof.
- I have reverted the updater back to a single-threaded design, but kept the frequency the same. I am hoping that the simultaneous instances will overlap and interleave reasonably nicely, with a minimum of work duplication. The current execution time is around 2 minutes, but this shouldn't be a problem, because the new thread is started every minute, so at least statistically, the catchment rate should be the same and duplication should be minimal (in the long term, anyway).
- The bonus is that the whole update now required about 11 MB of RAM, instead of about 65 MB (40 MB shared, so really only about 30 MB total). Needless to say, with 'only' 128 MB of RAM, it wasn't that difficult to blow things out of the water. Now that should be much more difficult. So far it seems to be working quite nicely, but I won't know for sure how beneficial it is until the peak 11:30 - 13:00 period tomorrow... Other pros and cons are that it is now nicer to the bandwidth, as there is only one download happening at any one time. The downside is that this makes it more voulnerable to bandwidth congestion, as if the current download blocks, it will block the execution of the entire process, rather than just the download for the single web site. More finely grained downloads would be nice, but I suspect that it will have to wait until the new server is set up - if the difference turns out to be significant in the first place.

05/12/2001
- Umpf... Had another server crash. The retrieval server this time. Slowed it down further, to 180 MHz. Something gives, I just don't know exactly what... If this fixes it, it is possible that it is the bus that can't quite handle 66 MHz. Bizzare, I know, and highly unlikely, but possible. If this works out, I might try 210 MHz later. The database server will be re-clocked either when it crashes next time, or when the retrieval server clocks a week of up-time - whichever happens first. I expect the former, but hope for the latter...
- Another problem is that The server was running out of steam at the peak period before this. The memory problem seems to have been alleviated somewhat, but I am looking at the diagnostic display at the moment, and there are 15 copies of the update application running. This indicates a very serious backlog, and it is still increasing. It remains to be seen whether it will clear itself up on it's own, or whether a bigger server will simply be necessary.
- Nope, it just broke. The server ground to a halt and rebooted itself. I have changed the cron job so it slows down to one run every 2 minutes in the 12:00 - 13:00 period during working days. Not an ideal solution as it is likely to damage the catchment rate, but there doesn't appear to have been any other way to combat the problem on the current server... Now that I have been forced to do this, I think I will go back to the multi-threaded parallel setup and see if it can cope now that it is run half as often...
- The answer appears to be yes, but guess what - I won't know for sure until tomorrow. I seem to have found out what one of the contributing factors was. The JobServe site seems to be either down or severely overloaded at the moment. I've just tried accessing it from both an NTL connection and a U-Net connection, and it just isn't talking at all. This coupled with the single-threadedness of the approach I was using was definitely a contributing factor that halted the execution of the whole update process - including the processes that could have been running usefully... This is another readon why the multi-threaded approach is so much more desirable - a single site going away doesn't break the whole thing as easily. True, the download will eventually time out, but by the time it does, the memory will get more and more and more congested until the whole thing stops. It will be interesting to see when JobServe will come back to life...
- Planet Recruit Seems to be having problems, too. It does appear to have returned back to life, but it is utterly grinding to a halt...
- It also turns out that the memory consumption difference between the threaded and non-threaded method isn't as great as it initially appeared. The difference seems to be at most around 20%.
- With JobServe down and PlanetRecruit only responding occasionally, the listings of today's jobs on JobsMeta are likely to suffer... Oh, well, such is life. PlanetRecruit seems to be bouncing back up... Slowly...
- Need new server... Or do I...
- JobServe has bounced back to life. It's likely to go on overload soon, as an hour and a half's worth of jobs at peak time (a few thousand at least) is going to hit it pretty quickly. It'll be interesting to see if it falls over again under the load.

06/12/2001
- I have identified the causes of some of the crashing experienced by the update server. It would appear that it is related to the network card. The database server doesn't seem to be having the same problems and the OS has literaly been cloned (not installed) onto it, so the two are identical in that regard. The crashes weren't actually 'crashes' as such. The machine would just loose network connectivity. I foolishly failed to check the routing tables before rebooting the system. The bizzare thing was that resetting the network interface, and even reloading the module didn't seem to fix the problem. Rebooting the machine, however, did. That pretty much leaves me with the impression that it is either the routing tables getting corrupted (less likely), or the network card breaking (more likely).
- Either way, I'll replace the network cards and see if that fixes the problem. Kind of makes me feel slightly futile when I'll be building a new server to replace this one anyway...

07/12/2001
- It would appear that have found the cause of the problem for random network-loss problems on the update server. Both the web and the database server have been tuned down to 180 MHz, and it all seems to be working perfectly now.
- The update time has been turned down to 2 minutes all round. According to the statistics I have reviewed, the loss-rate due to this seems to be minimal. Now there are only two time bands for updates - the 5 minute band (20:00 - 08:00) and the 2 minute band (08:00 - 20:00). Same banding has been applied to weekends. This makes the cron schedule file much simpler. It should also yield a considerable saving in bandwidth, prevent extreme peak hour congestion, and even allow some breathing space for adding more plug-in modules for additional search sites - even though most of them don't seem to be carrying enough jobs to make their inclusion too worth while...
- Increased the storage time to 2 weeks, mainly for statistics gathering purposes. Added statistics readout for last week and current week, to allow for comparison and general observation in the movements in the market. The downside is that the statistics page is now even slower, and it even times out sometimes.
- Got fed up with the statistics page being too slow. There is far too little (if any) benefit in keeping things for longer than one week anyway.
- The 2 minute update seems to often max out on downloads from TechnoJobs. Others are doing reasonably OK. When the new server is in place, the update will have to be turned back up to once a minute.

08/12/2001
- The weekend updates now run on the same schedule as the working day ones. The reason for this is that a certain traffic pattern has been observed that indiates catchment failures and overflows when the 2 minute interval is used.
- Small HTML updates for the statistics page to make it work better with Internet Explorer.
- MySQL server has been reconfigured slightly, with some memory allocation changes and Full Text Search tweaks. The result is an approximately 100% improvement in performance of Full Text Search features, and a noticeable improvement in performance of the statistics page.
- Not a single glitch has occured since the servers have been tuned down to 180 MHz. I think the cause of the problems is becoming fairly clear. There is the 192 MHz setting which could still be tried in a final attempt to get the last bit of performance from the system, but it is possible that this would either cause instability again, or yield in actually reduced performance, due to the bus being slowed down further, down to 55 MHz. Either way, things are working now.
- With a bit of luck, MySQL v4.0.1 will bring enough improvement on the Full Text Search front that the new server will not be necessary. That would be quite nice...
- Changed the database layout slightly. The Posted field is now DATETIME instead of timestamp. The Retrieved TIMESTAMP field is now auotogenerated by MySQL at insertion time, which is exactly what I want anyway. This has simplified the data uploading code slightly, and probably gained a tiny bit of performance.

11/12/2001
- Made a small change to the way indexing is done in the database. It should improve the performance of the job uploader.
- Added a download module for the Computer Staff web site.
- Added the Downloads Page. This is where the full source code for JobsMeta can be downloaded. The snapshot is updated by a daily cron job at midnight. The source code is released under the GNU Public Licence. Not all the files have been updated clearly to state this in the headers yet, but it will be done at some point.
- Tidied up some job retrieval code - things should again be a really tiny bit faster.

12/12/2001
- Added the CVStore retrieval module.
- Adding the last two retrieval modules seems to have pushed the current server right up to it's limits. It is nearly going on overload with the one minute update interval at several other times of day, most notably around 16:30-18:00. This means that I will now have to slow down the whole daily update to 2 minutes. This is the only sensible option, really. At least that way there is some spare capacity for the web server to run on the same machine. Obviously, the catchment rate will suffer fairly seriously, especially at peak times, but since this is all just a development system, it should be OK for the time being. The production system will be running on a much bigger server that should easily be able to cope with all three sides of the application (web server, job retrieval, database) with loads of capacity to spare. Oh, well - I guess this finally makes the decision of "new server vs. no new server" for me...
- Reviewed the idea of incorporating some of the other web sites to the download list. Most of them don't seem to be worth it, with a grand total of jobs posted over the last week totaling to around 1 (I am NOT kidding) for some of them. Other seem more promising, but are likely to prove a bit tricky to extract data from. But, I'm sure it can all be done in a meaningful way...

13/12/2001
- Added the ITJobs4 retrieval module. This definitely pushes the limits on the current server. Even if the one minute update frequency wasn't too much before, it definitely would be now. I am just concerned that the peak hour updates will fail, even with the timing reduced to 2 minutes... On the positive side, this one has reasonably decent content. Not quite up to the volume of the heavies, but it is about half way up the throughput range.
- Added ITPaths retrieval module. The server is starting to feel the strain...
- Checked all the download modules, and made sure that no unnecessary libraries are included. This should make the runtime memory requirements slightly smaller. And oh, does the server need all the help it can get...
- All this, and it's barely the afternoon! :-)
- 16 known job sites remain that have not yet been added to the search. At least a quarter of those are probably not worth adding due to their small throughput. Some of the remaining ones may be difficult to extract data from. Either way, things are definitely progressing.
- Added ITJobLink retrieval module. 15 job sites remaining. Damn, I'm having a productive day.
- Increased the number of top agencies listed from 20 to 25.

14/12/2001
- Moved the progress report to the new News page from the About page. Now there is a separate News/Downloads page where the Agencies page used to be. Since this site is not intended for posting by Agencies, the link was not serving any purpose anyway.
- Some sort of "locking" is under consideration. It could be done per update, or per thread. This would potentially allow the update time to be set to, say one minute, and the following updates would then not start if a previous one hasn't finished. Per update locking would be simpler to implement. It would just mean that if an update is started before the previous one has finished, the update new process would immediately abort. Per thread locking would allow a much finer granularity of the update process locking. It would allow the locks to be placed for each thread that relates to a download from a specific search engine. This means that the following updates would still execute for the source sites that do not have other updates currently running. This should be a much more sophisticated and effective system than the one initially used where the new process simply killed the old one.
- There is also the issue of detecting and removing stale locks, and killing the relevant processes when they take too long (if they are running at all, and haven't crashed). Nothing that cannot be solved, though. Must go away and think about this for a little while.
- Added a REx retrieval module. This one was a bit trickier than usual because their layout structure is slightly dynamic. But, it was possible to compensate for that without too much trouble.
- UMTSWork web site seems to use PRECISELY the same engine as the REx web site. They are so similar, including the CGI (ASP, actually) parameters, that it almost smells fishy. In fact, it couldn't be fishier if it came with tartar sauce. Either way, it should make writing the retrieval module a piece of cake.
- Yup, I was right - identical. Even down to the dynamic peculiarity of the REx page code. Oh, well, makes my life easier. Need I add that UTMSWork is now in the retrieval list? :-)

17/12/2001
- Added the ITJobChange retrieval module. Eleven other known job search engines remain. Most aren't worth the bother, though...
- It would appear that most of the server overloads are actually occuring as a consequence of what appears to be a bug in the cron deamon. It sometimes triggers more than once for a specific execution. This means that sometimes there is an explosion of activity for no aparent reason that causes the whole system to overload. There is only one thing I'm aware of that could be causing this to happen - the hourly time-sync process. It is, however, extremely unlikely that this would cause a problem like this.
- The problem detected in the previous item can be overcome by the process locking mentioned earlier. This means that the process locking will solve the overload problem from two fronts, which means better catchment rates and faster updates. Hence it has moved to the highest priority item on the TODO list. The only problem is that I am not entirely sure what is the best way to implement it...
- Had another server crash, this time the database server. Both servers have finally been turned down to completely stock settings (166 MHz). I guess the combination of CPUs and motherboards just isn't up to being overclocked. Considering all the other problems this setup was having, I should probably consider myself lucky if it works even in stock form...
The upshot is that even though the CPUs are clocked to 10% less, the bus is clocked to 10% more, and the bus bandwidth is probably a bit more important in this application.
- Process level locking has been implemented. It implements what is sometimes called a 'Highlander Solution' - there can be only one process executing an update from a particular source at any one time.
The update interval has been turned down to 1 minute. The lock "staleness" has been set to 2 minutes. This means that if the lock is older than 2 minutes, it will be deleted, and the new update will start anyway. This should take care of the peak time server overloads by only slowing down the threads that are taking a long time, but it should also limit the amount of interference to the limit that the 2-minute updates would. In other words, it will never be slower than 2-minute updates, but it can go as fast as 1 minute updates. The idea is to provide a self-adjusting setup where we only provide the limits for how fast and how slow the updates can work. After that, the system will try to get as close to the fastest setting that the hardware will allow.
I will experiment tomorrow with setting up the system to attempt updates every 30 seconds. Cron cannot deal with sub-minute intervals, but I have an idea about how to work around that.
The really cool thing is that the locks actually get checked and applied before the master process forks to execute the updates. This means that the memory requirements are kept minimal. Even better, in order to keep the lock time down to a minimum, the locks are removed inside the child thread, just after the data has been retrieved, but before the data has been uploaded. This should insure that the locks are in place for as short a time as possible, and the updates are actually nearly instantaneous, because they are queued on the database server side.
The problem is that if things go badly wrong, very long running processes can still cause server overloads. This is because the Locks aren't 'final'. Instead, they have a 2-minute life-time. I may Implement some sort of a stale process hunting mechanism later to make absolutely sure that things never have a chance to get out of hand. But for now, this should be more than adequate, and more sophisticated solutions are probably not going to be necessary any time soon...
- Made some modifications to the page striping in all the modules. The page pre-processing for data extraction should now be much faster. In fact, it is so much faster that the whole update sequence now takes under 40 seconds! Considering that it was slower than that before I added half of the retrieval modules there are here now, that is a pretty serious improvement. :-)

18/12/2001
- The new locking code is working absolutely beautifully. However, it is rather worrying to see exactly how often the double triggers actually happen - according to the logs, it happens on average once every 15 minutes or so. That is fairly serious, and coupled with some heavy peak traffic, I can see how it could cause the server to go away for a rather long time.
- It looks like the update frequency of 1 minute is not quite sufficient for the full catchment. ********* has just maxed out at 20.
OTOH, some catchment failures are pretty much inevitable with this kind of application. I'll have to investigate how often this catchment failure occurs, but right now it doesn't seem to be particularly serious.
Unfortunately, if this is caused by what I think it is caused by, even cranking up the updates to 30 seconds will not help. So, it is probably best to just leave things as they are now...
- It looks like the biggest culprit for server overloads during peak hours was actually JobServe. And it isn't because of the number of jobs they get posted during the peak periods - it is because of the load they face from users during the peak periods!
The number of jobs retrieved is rather small (often less than 5 per update), but the lockouts occur quite often. This implies that there is a problem with the JobServe servers coping with the load. Absolutely pathetic - I was really expecting the 'best' job search engine in the country to be able to cope better. But then again, they are running things on ASP, and probably everything else that implies, so it is probably not really all that surprising... More disappointing than surprising. ;-)
All this means that the locking mechanism that has just been implemented is improving the performance of JobsMeta in more ways than it was anticipated. :-)
- Changed the site appearance slightly, to make a slightly better comprimise between the different browsers.
- The web server has crashed AGAIN! Argh! I'm stumped now. I thought I had the cause pinned down to overclocking. It turns out that I must have been wrong. Again, there is nothing in the log files...
Now I'm back to square one. It could be a faulty network card, motherboard, memory, CPU, anything. I am going to have to keep an eye on the database server very closely. If that crashes in a few days as well, then it is more likely a software problem somewhere.
I will have to check the RedHat site tomorrow for an updated kernel and see if that fixes the problem... I have a gut feeling that it has something to do with SMP and RedHat patched kernels. I may have to go and do some thorough reconfigurations and upgrading on my network to get the latest kernel running on all machines. In fact, I'll probably install just the clean source kernel without any patches on all the machines, without all the gimmicky patches... Probably a good idea to upgrade the BIOS on all machines that need updates...
I guess some 'maintenance' was due anyway...

20/12/2001
- Another server crash today! This is getting ridiculous...
- Right. Just upgraded the kernel. As clean as it can be, only the RAID patches have been applied. Let's see if that improves the stability... It doesn't seem to have removed the double-triggering cron problem...

21/12/2001
- No server crashes yet since the upgrade. The upgrade seems to have solved the 'unexpected IRQ vector [???] on CPU#[?]!' error.
- Doctored the wrapper execution shell script to compensate for lack of functionality to run things in intervals shorter than one minute. Since it looks like overflows are still quite common on some of the 'small buffer' source sites, I have turned the update interval down to 30 seconds. This should work just fine, as the server overloads will still be prevented by the locking mechanisms implemented earlier. So, in the worst case, this shouldn't hurt anything. Ultimately, the number of updates per minute are limited by hardware and bandwidth, and the software is now self-limiting so it will not die horribly when it runs out of resources.
- D'oh! It looks like the sites that are overflowing actually take so long to check that the faster updates aren't actually helping. Instead of just overflowing, the lock and overflow, which isn't very useful. Switched off the 30 second updates. Back up to 1 minute. Umpf...
- Fixed a fairly nasty bug that in the JobSearch retrieval module that caused some jobs to not get inserted.
Fixed a similar bug in UMTSWork and REx retrieval modules.
- Made a little program to process the logs. This should help identify how often overflows occur, and on what sources. Since running updates twice as often doesn't do the trick of compensating, I will probably extend the needing retrieval modules to dig deeper into the source when overflows are detected, thus removing any possibility of an overflow occuring.
- The log processing program has revealed that rather worryingly, overflows sometimes occur even on JobServe (250) and GisAJob (100). It appears that this sort of blocking is occuring as a consequence of locking sometimes causing every other update to be skipped. Just re-tuned the locking to see if the system can cope with the one-minute update load, if there is no interference by the multiple triggering (which locking removes anyway). There also appear to be occurences of cron updates triggering randomly between the minute boundaries. Locking should take care of that as well.

24/12/2001
- Just upgraded MySQL to 4.0.1. The performance improvements are negligible, which is quite unfortunate. There is a new feature, however, that allows query caching. This compensates very well for accidental double clicking and such. So, it sometimes helps in reducing the server load. Doesn't do much else, though...
- The servers have been up to 180 MHz for the last four days and still no random crashes. It looks like using a clean 'vanilla' kernel solved the problem. Obviously some of RedHat's kernel patches broke things on this particular system.
- Pre-insertion logging facility has been added. This should greatly reduce the number of false overflow alarms, and false inserts (probably all of them). It should also quite significantly improve the performance, because the selects on this fast lookup table are much faster (the table is much smaller, and all fields are fixed length). Updates are now down to under 30 seconds!

01/01/2002
- Through some clever use of the perl "eval" function, the main update function code has been reduced to a tiny fraction of it's previous size. It now all fits on a single screen! :-)
- CVStore drill-down implemented.
- JobBoard drill-down implemented.
- TechnoJobs drill-down implemented.
- ********* drill-down implemented.
.

02/01/2002
- PlanetRecruit drill-down implemented.
- ITJobChange drill-down implemented.
- ComputerStaff drill-down implemented.
- ITJobLink drill-down implemented. This one was fairly nasty... So far this method has yielded some fairly spectacular results. Not only has the catchment been improved to perfection, but the increase in throughput indicates that the previous losses due to overruns were far more serious than expected. More importantly, this new feature has just opened the way to a new application. Should this site ever be forced to shut down due to objections from other search engines, these modules will be able to be re-used in a standalone user-downloadable application a-la Copernic meta searcher that performs on-demand cross-engine searches. Such an application is likely to get developned anyway once this web site has been completed. :-)
- ITJobs4 drill-down implemented. I've just realized that they actually keep the SQL statement in the form on the HTML page! Oh, dear... Made me laugh, anyway.

03/01/2002
- Fixed a bug ITJobLink retrieval module. I hadn't noticed this before, but the jobs there are posted in the ascending date order. This means that the first page contains the oldest jobs, and the retrieval has to be done backwards. Unfortunately, this means that the retrieval is much more inefficient, but it cannot be helped.
- Unfortunately, because ITJobLink is displayed backwards, there is no point in updating it at every run - it is just a waste of resources. Instead, the ITJobLink update run has been moved to the Daily script that runs just after midnight. This way, all jobs will still be caught, but the delay on them can be up to 24 hours. Sorry, but there's not much more I can do while keeping things efficient. The server is being pushed to the limit anyway... The upshot is that a significant amount of bandwidth will be saved, memory consumption reduced, and updates speeded up.

04/01/2002
- ITPaths drill-down implemented.
- TechnoJobs have changed the way their web site works in a few details. Email address is no longer displayed, presumably in an attempt to stop page ripping. [ahem] They now use a redirector script that is supposed to just fire up a mail window. This probably works for IE users on Windows, but falls over flat on it's face on most other things.
Either way, this has been compensated for, and the email addresses are again being extracted from the TechnoJobs site.
- Changed the download scheduling a bit. Now the scheduling runs at 1 minute intervals during the 02:00-03:00 period. The reason for this is that there seems to be a huge surge in job uploads to GisAJob during that time, especially just after 02:00. This is the only way to compensate for overflows like these on GisAJob (and JobServe, 4Weeks, and JobFizz). According to the numbers I'm looking at, this surge may well have been causing a loss of around 1000 jobs every day, from GisAJob, that weren't being retrieved.

06/01/2002
- REx drill-down implemented.
- UMTSWork drill-down implemented.
- Unfortunately, it turns out that both REx and UMTSWork implement things in the same broken way in which ITJobLink does. In fact, they are even worse - they limit the results to 200. This means that daily, it is only possible to get the FIRST 200 posted jobs AT BEST. Still, that is better than the current catchment rate...
These two will therefore also be moved to the daily update script, as there is no point in parsing the same pages over, and over, if there is no chance of catching new data...
Either way, this finishes the drill-down feature implementation. Rather convenient, too, because I start a new job tomorrow, and I may be unable to do any more work on this site for a while.

07/01/2002
- Applied a derivation of the depth retrieval method to the 4Weeks module. Now it retrieves the last 25 jobs. This should make the number of retrieved jobs skyrocket. The current numbers imply that 4Weeks is a very good source, content wise.
- Modified unique indexing mechanism. There is now only one unique index on the Jobs table. For the sites where the Remote IDs don't apply, they are achieved by concatenating the posted date, the reference number and the agency name. This replaces the functionality of the previous second unique index. The advantage is (hopefully) that the indices are smaller, and hence faster.

16/01/2002
- I've been lazy with updating this page over the last week. The user account maintainance section is now complete. Some minor timing changes have been made to make things run a bit more smoothly during peak hours. All functions except those directly responsible for Page/State handling have been moved from the main JobsMeta library into an Auxiliary library. This should help keep things a bit more maintainable.

17/01/2002
- Finally, a brakethrough! By making use of the "USE INDEX" clause for explicitly specifying indices to be used by SELECT queries, the searching performance has been improved by approximately 10 times! Yes, that is 10 TIMES in the 0-day case. The improvements reduce linearly as the date range is increased, but in the general 0/1-day case, the improvement is rather spectacular! The query time is now in the sub 1 second range! The only explanation I can think of for this is that the FULL TEXT is performed indexed regardless of other indices used. This is a bit unintuitive, because MySQL is supposed to only use one index per table per query. This appears not to include the FULL TEXT indices. By pre-reducing the data set through the Retrieved and Type fields, the usually slow FULL TEXT now only needs to look through very few records, thus yielding a massive speed improvement. :-) This has probably been the biggest stumbling block so far, as far as the performance is concerned.

18/01/2002
- Did some additional performance testing on the new indexing method. The old FULLTEXT only search has a flat query time of around 10 seconds. This makes it the fastest search for 2+ days worth of data. At 2 days, the score is tied, so FTS is used by preference.
However, the search by Retrieved field is drastically faster in the 0/1-day case. The measure performance is at 1-2 seconds for the 0-day case, and 4-8 seconds in the 1-day case, depending on the other options specified.
- Oh, the annoyance... It looks like there is a bug in MySQL that makes the new and improved indexing I tried to implement fail. It is faster, but unfortunately, the results it returns are wrong. Oh, well... Back to the old and slow way of doing things. At least it is reliable...

20/01/2002
- Yes, it has finally happened! The bulk application is now partially implemented! The bulk application works for all the jobs that have the application email address listed. There are some job sites, such as PlanetRecruit, JobBoard and CVStore that do not list an email address, but instead provide an on-line application form. These forms are not supported at the moment. The support for use of those forms for automated application is pending, and will be implemented shortly.

21/01/2002
- Replaced the home-grown binaries for MySQL with the ones from the MySQL web site, and installed the RPMs. Hopefully, this will solve the problem of random server crashes that I have been experiencing recently...
- D'oh! It looks like I will HAVE to compile the package myself after all. I need to modify the source to get some relevant key words out of the stop-wort list... Well, last time it was built with fairly extreme optimisation parameters, such as -O6 and a few other things. Maybe it will be sorted out now. Also, this time I am building it with the RedHat GCC-2.96. This could allegedly cause problems, but since PGCC-2.95.2 compiled version seemed to have problems, the options are running short... I've taken the opportunity to strip out the unnecessary default functionality.

23/01/2002
- Random crashes are starting to plague the system again. I am seriously starting to suspect that I have a pile of dodgy hardware. I'm almost prepared to point the finger at the old GA-586-DX boards. Ah, whatever. Will get a new server for the whole system anyway.
- Added a new statistics page that breaks down the job throughput per hour. This should provide good, reliable information about peak periods, and help tune up the cron job to maximise the catchment rate, and at the same time minimize the bandwidth requirements where possible.

24/01/2002
- It is still unclear as to what may be causing these random crashes. So far, they have been occuring within 24 hours of a reboot, for the last several days. That is, needless to say, very, very wrong, especially since nothing on the system has changed. I have set up up2date on the system, and updated to the latest versions of all the packages and rebooted the servers, but I somehow doubt that will fix things. If this doesn't work out, then I am out of ideas. The one last thing I could attempt is upgrading to RedHat 7.2. This may solve some of the problem, due to the newer v2.4 kernel series, but I was really hoping to put off upgrading to v2.4 for quite a while, until the development on the next production release starts. Otherwise, I'm likely to face constant upgrading. For a long time...

26/01/2002
- Added a CWJobs retrieval module.
- Griffin Internet have started doing a 1 month minimum length contract on ADSL, so the upgrade to a properly hosted web site is fairly imminent. I will order the installation on Monday.

29/01/2002
- The solution to duplicate adverts has proven to be much simpler than I thought. Implementing a unique index over the Agency name and the first 255 characters of the job description has yielded trimming of about 20% of the jobs in the database. This is a good thing in many ways.
1) It gives a much fairer picture of how many jobs specific agencies are posting.
2) It gives a much more accurate picture of how many jobs are on the market at the moment, thus more accurately indicating the state of the industry.
3) It removes the very irritating situations where the same job appears on the same page half a dozen times, just because the agency has posted it to multiple job sites. (This is the primary reason the feature was implemented - to increase the signal / noise ratio in the job adverts.)
4) It reduces the number of records in the database without reducing the number of actual jobs. This means that it puts much less strain on the system, and makes the searches faster.
5) Only the first of the duplicate jobs gets posted. This means that it will get listed on the site that displays it first, thus favoring faster sites in the statistics view.
- The re-implemented Agency name filtering (strips off the trailing ltd/limited/plc) should provide a further reduction in duplication, but it will take a few days for the system to start being effective.
- The object-oriented porting is going quite nicely, even though I have encountered a problem. For some obscure reason, the multiple inheriting classes clobber each other at the end program level. I'm a but stumped as to why, but I've worked around it by importing classes after forking the threads. It would be nice to avoid that, but it will do for now, until I can figure out what exactly is going wrong...

04/02/2002
- Finally, the JobSearch retrieval module has been converted to use the LWP library instead of the ugly Lynx shell request. The problem was that the JobSearch site used non-standard field separators that were getting incorrectly converted for the HTTP GET request. The object orented port is over 2/3 complete. Once this is complete, I will finnish off the bulk application functionality to include the web-form based application for cases where it is the only option.

05/02/2002
- Made further enhancements that will aid the filtering of duplicate jobs.
- Overloaded the constructors to avoid the 'use' clauses inside the fork blocks. It is probably more elegant than the old workaround...
- The filtering of duplicates is working quite well. It now appears that the total number of jobs posted may have contained as much as 30%+ of duplication. Of course, the reliable measure of this will not be attainable for another week, but preliminary results are very promising.

06/02/2002
- Implemented a further feature to prevent server overloads. Now new instances of the update program will not be executed if there are already more than 20 running. This should remove any possibility of server overloads.

10/02/2002
- After suffering a complete raid stripe failure yesterday, and rebuilding the complete database server, it looks like I have finally found a reliable workaround for the stability problems that have plagued the system for months. After searching through countless archives, it appears that there is a reasonably well known bug in the GA586DX motherboards. The APIC is to blame. Disabling it fixes the problem. So far, so good. The retrieval server has already been up for longer than usual, and there are no odd occurences of unkillable processes. This just may have fixed things.

13/02/2002
- Added a config file to allow easier configuration of some plausibly tweakable parameters.
- The stability of the servers has so far been very good. It looks like disabling some of the SMP APIC functionality did the trick.

23/02/2002
- Updated the WorkThing retrieval module. They changed the page layout on the site, so the module broke. All is working again.
- Modified the way forking and locking works. Now the locking checks for the number of update processes/threads already running. If there are too many (configureable), it will not kill the old process lock and re-run. This provides a rather effective throttling feature. The main update program has now been modified to do locking and forking separately. This has the advantage that the locking code doesn't count the threads already running from the current update run, so the throttle process limit is much more easily and precisely controlled. This should reduce server overloads and at the same time make the program more configureable. There is now an option for this in the config file.

02/03/2002
- Overhauled the authentication and session verification code. It is now much more compact and maintainable. All the Session and Cookie management is now performed in a single, compact function. The modification will make the application smaller and faster.
- The feature to save user-defined queries has been implemented. This pretty much finishes off the feature set. The remainder can be implemented as and when. Probably after the ADSL is installed and the domain name is changed to the real one.

05/03/2002
- Wehey! First user feedback today. Updated the filtering and improved the classification of Contract/Permanent jobs for the jobs that were not being classified ('Either' type).
- Implemented the unique reference number filtering.

23/08/2002
- By popular request, a feature to allow location sarching has now been implemented. It is not as advanced as I had hoped for it to be, unfortunately. It has been implemented by adding the Location field into the Full Text Index for searching. Whether this solution is any worse than what I had planned is not really very clear, but even if it is not as good, a rudimentary ability to include location searching now exists.
- Added a query example to the search page.
- Upgraded MySQL to 4.0.2. This allows for a few performance increases. The core of the DB isn't proving to be any faster, but there are a few bug fixes that now allow for some hand-optimisations to work in few places where they used to completely break the queries.
Please forgive any instability you encounter while the new installation stabilises a bit. Some of the new optimisations may take a while to test thoroughly, although they seem to work just fine. Please report any unexpected problems.

31/08/2002
- Added GoJobSite retrieval module. This was a bit awkward because the drill-list can only be accessed if cookies are enabled. To keep things generic, the Generic module was upgraded to pass all cookies the site sets when they are required.

02/09/2002
- Re-wrote the 4Weeks retrieval module. It now runs as a single daily update. This should save both memory and bandwidth. I only relised this today, but it can be done in EXACTLY the same way the JobFizz retrieval works.
- Typical. Six months of nothing and then two MySQL updates get released in a matter of weeks. Will have to upgrade the server to MySQL 4.0.3... Maybe not today, though...

20/09/2002
- Suffered disk failure on the web server. One SCSI disk failed with a bad block, and took down the RAID stripe and some data with it. All back now. All the disk seemed to need is a low-level re-formatting, and a media scan to re-map the defects.
- Added a feature that highlights the searched terms in the result set. This should make it more obvious how suitable each job is.

25/09/2002
- Added another highlighting feature. Since MySQL now supports term* type matches, this functionality has been added to the highlighting code.
- Implemented a small optimisation in the GisAJob retrieval module. Before it retrieved all jobs, regardless of the "sector". Now, it only retrieves the IT jobs. This should make a small saving in bandwidth.

26/09/2002
- Made an optimisation in the handling of the multi-page results. All the records, regardless of the page number, used to be retrieved from the database. This was inefficient: it ate up a lot of memory in the server process, and all the records had to be parsed for the recently implemented highlighting. Now the SQL queries have been modified to use the LIMIT/OFFSET clause to insure that only the records for the particular page are returned.
- Fixed an obscure potential security hazard.

27/09/2002
- Added TheITJobBoard retrieval module.

02/10/2002
- Upgraded MySQL to v4.0.4. Amazing. 6 months with no updates, and then v4.0.2, v4.0.3 and v4.0.4 all come out within a month. I guess this is a good thing...

TODO List
- Process locking to prevent server overload and improve catchment rates. Done!
- Sub-1-minute updates. Done!
- Some kind of 'RemoteID' quick pre-insertion, so the follow-up updates don't duplicate the work. This should be reasonably easy to implement soon, and it should improve performance and save bandwidth quite significantly, especially during the peak periods. Done!
- Selective depth search on retrieval. Done!
- User account maintenance. Done!
- Update everything into a more object oriented design. This should improve readability and maintainability. It will also massively reduce the code size, due to inheritance - there is a LOT of code duplication between most of the retrieval modules... Done!
- Break configurable things out of the code and into a human-readable config file. Done!
- Feature to save user-defined searches. Done!
- Duplicate advert nullification. Done!
- Bulk application. In progress!
e-mail application works. Need Separate application routines for each web-form submission...
- Location searching. Done!
- Sign-up e-mail verification. Pending...
- Track keeping of most popular keywords. Pending...
- Statistics about jobs that members apply for. This will later enable the creation of an AI that will recognize what jobs a member might find interesting, and alert the member of "highly matching" jobs. Pending...
- Instant e-mail/SMS verification for closely matched jobs. Pending...
- More retrieval modules. ATSCOJobs and TheITJobBoard. TheITJobBoard Done...
- Re-skinning to something that looks a bit nicer. Pending...
- Highlighting of search terms in the text body will be moved to client-side, using JavaScript. This will save the server from doing a fair amount of work, thus making it able to serve more requests. It should all still work on all commonly used browsers. Pending...




JobsMeta News