|
Shtuff - Here are some of the lovely, lovely people and companies behind
JobsMeta. Go visit their sites, and stuff.
|
Progress Report:
20/11/2001:
The searching mechanism is fully functional. The database has been
ported from PostgreSQL to MySQL, and the transition required fairly
minor code modifications. The main reason for this was speed - on
a test I conducted, specifically for this application, MySQL
was about twice as fast on the same system. Other databases which
have been considered and tested with the system were mini-SQL and
InterBase.
mini-SQL died horribly after blowing 1 GB of RAM and went swapping
in order to do a join between two 30 MB tables and perform a full
text search. It also took a very long time before it died.
InterBase worked fine, and it was reliable, but some of it's
features were a bit odd. Most importantly, however, it was much
slower than PostgreSQL, even after it was tuned according to the
documented parameters (it was given enough memory to keep the whole
database in cache, and to still have a disk copy of database on OS
cache). It was hence rejected as a viable option.
I have also considered downloading and trying the free trial
versions of Oracle and SAP, but by the end of the exercise of
installing and comparing the performance of four databases, and
porting my application four times, I really didn't have enough
energy left to also go and play with something as heavy as Oracle
and SAP. Maybe another time, in another application.
Currently polled job sites are:
Gis-A-Job, Job Board, JobServe, JobFizz, Job Search, TechnoJobs,
Planet Recruit and Work Thing. That just about covers all of the
biggest ones. There is still one or two left, and I will add those
in due time.
The paging has been implemented. Currently, there are 10 jobs per
page, but this is easily configureable. This will be adjusted
appropriately in the near future. I will probably need to change
the layout of the displayed jobs, as currently a lot of space is
wasted - must have a think about what the best way to do this is.
Jobs are kept for 7 days - after that they are deleted. All the
retrieval and maintenance tasks have been fully automated. Another
advantage of MySQL over PostgreSQL has been that the daily cleanup
of the database is much faster. Not that it matters at midnight.
The number of jobs seems to hover between 40K and 50K, roughly
averaging about 10K per working day.
23/11/2001:
JobsMeta
has been moved to a faster server (finally!). It is no longer running
on an old Cyrix 6x86 133+ (which has been recomissioned for another job)!
It has been moved to a newer server (although it's nothing to really
write home about). It is now on a Dual Pentium MMX 200 MHz, with
128 MB of RAM, and a 6 GB 3-disk SCSI RAID-5. It is only Linux software
RAID, but it does the trick nicely. Tested by pulling one of the disks
out. :-)
I guess I can sleep a little easier at night, without having to worry
whether I have forgotten to change the backup tape. :-)
Tomorrow I'll be moving the
JobsMeta
database to a new server as well. It is going to be a downgrade, though.
It will be moving to the machine of exactly the same spec (identical, in
fact) to the new web server. Currently it is on a Dual Pentium III 1 GHz with
1 GB of RAM, but the queries are so fast anyway that it shouldn't make a
noticeable amount of difference, and I need the bigger server for something
else...
24/11/2001
The
JobsMeta
database has been moved to the new server. The performance is still more
than reasonable, although the difference is fairly obvious. It is
comparable to the old setup with a 110 MHz web server and 2 GHz database
server, which indicates that the web server faces heavier load, and it's
performance is hence more closely related to the performance of the web
application. Still, this configuration is very similar in speed to the old
configuration was with PostgreSQL.
The MySQL server has been recmpiled, and now it can index the term "c",
which is quite clearly useful when looking for a C programming job. :^)
Changed the paging setup to do 25 jobs per page.
Re-wrote all the data extraction modules. Now the download process takes
into account the remote ID of the Job being downloaded. If the job has
already been entered into the local database on a previous run, the page
containing the details of it will not be downloaded again. This should
provide a saving in bandwidth of up to 10-250 times during the less active
periods. Obviously, when there are lots of fresh adverts being posted,
this won't provide much benefit. The downside is that the database gets
a SELECT query for every Job to see if it is already there, but this
is OK since SELECTs are cheaper than INSERTS, downloads are much
slower than local database access, and parsing the pages takes up
most of the time, so any saving to be made there is welcome. Overall
this solution should be much faster. Hopefully, I will finally be able
to turn the update cycle timing down to 5 minutes instead of 10 minutes.
Currently it's down to 1 minute, but the only way to make sure will be
to wait until next week and see what the benefits are during the heavy
periods. In theory, the more often jobs are updated, the bigger the
saving, because the more duplication there is likely to be in the latest
job lists. Eventually, we should reach a point where the bandwidth used
for downloads will go down to a constant.
JobFizz couldn't be optimized in this way because all their latest
jobs are on a single page with no separate 'details' page, so the
whole lot has to be downloaded anyway.
Added the icons for the software used to create this web site, and
the general methodologies adhered to during the design.
25/11/2001
The data acquisition modules have been enhanced further, in order to
save bandwidth. It would appear that as of recently, the perl libwww
supports 'gzip' and 'deflate' encoded content. All the data acquisition
modules have been modified to send the 'Accept-Encoding: gzip, deflate'
headers. This means that provided the servers on the other end support
sending of compressed data, the pages will be downloaded compressed,
thus providing a further saving in bandwidth. This time, the saving
should be seen primarily during the busy periods, so coupled with
yesterdays modifications, this should provide a sizeable increase in
throughput.
mod_gzip has been installed on the Apache web server since this
project began, so the speed percieved by the users will not be
directly affected by this modification. The percievable performance
improvement will come from the more network bandwidth and CPU time
being available for tasks that are actually useful, such as performing
queries, generating requested pages, and generally serving the user's
requests.
Added performance stats to the query page. Now it displays the time
that it took to execute certain parts of the request. The results are
somewhat disappointing. The query parsing into SQL and processing of
the data is pretty fast, but the query times are fairly poor. It seems
to vary between around 0.5 and 5 seconds, depending on the number of
search terms. I suspect that MySQL will improve in performance in the
next few versions, as there is some optimization to be made in the
Full Text Search mechanism. But then again, moving everything to
a dual 2 GHz server will probably solve this problem, and move the
query times down to 0.05 - 0.5 second range. Right now the performance
is such that the system will peak at handling around 100-150 hits/minute
before the point of goodput collapse. The chances are that the server
will be upgraded long before even half of this throughput is reached,
especially since a typical query will execute in well under a second.
It's only the 'stress' queries I use for testing that take a while.
The more worrying aspect is the bandwidth consumption. A full page
with 25 results averages around 100-110 KB. That means that, the uplink
can transfer one page every 6 seconds or so, provided the user's
is on at least dual ISDN (128Kb/s). This limits the throughput to
a maximum of around 10 hits/minute, which is most definitely NOT
good. The connection will saturate WAY before the current server
will max out. But, things aren't actually that bad. mod_gzip comes
to the rescue here. An average page has been tested to compress
down to less than 10% of it's size (down to about 9 KB). This
increases the throughput we can handle approximately 10 fold, up
to about 100 hits/minute. Seen as this is just before the point where
the database server will peak, no upgrades will be necessary until
the bandwidth is increased first.
On the plus side, the actual parsing and processing times are VERY
low, so only the database server REALLY needs to be upgraded.
Changed the icons at the bottom of this page to links to the
relevant web sites.
|
26/11/2001
|
|
-
|
The updated, optimized retrieval routines are
working VERY nicely. Even during the normal daily
peak times, the downloads seem to execute
without a fail in under 50 seconds. I have turned
the update frequency down to 3 minutes. This
should provide a MUCH better catchment of
jobs from lists that only include as little as
the latest 10 jobs, as they overflow and roll
over very quickly. The added bonus in all
this is that it allows a much fairer comparison
in the number of jobs posted to different job
sites. It would appear that JobServe is not
quite as posted compared to the other ones as
it previously appeared. It looks like all the
other sites that are being crawled (7 of them,
at the moment) do actually match the number of
jobs posted on JobServe. This means that
JobServe is 'only' as good as most of the others
put together. Before it appeared as if it was
twice as good as all the others put together
(2/3 - 1/3 split).
|
|
-
|
Fixed a bug in how the remote IDs were uniquely
indexed. This could have potentially prevented
some jobs from being downloaded.
|
|
-
|
Enhanced the Agency download module to use the
same enhancements (compressed downloads,
pre-fetch pre-checking) as the Jobs downloads.
This should speed things up on that front, but
this was really just for the sake of completeness,
as the Agency updates run only once per day
anyway.
|
|
-
|
Compiled and installed PGCC-2.95.2.1
(GCC-2.95.2.1 + PGCC patches). Re-compiled
MySQL with maximum optimizations (-O6 and a few
other things). Stripped the MySQL daemon (it
shrunk down to half of it's size). The MMX
optimizations seem to break things in places so
they have been switched off. The net benefit
measured in heavier queries seems to be in the
region of 1-5%. Not a really astounding
improvement, but an improvement, nevertheless.
Running MySQL server with --skip-safemalloc
(yes, I know that is a stupid idea) actually
makes it slightly slower for some reason...
|
|
27/11/2001
|
|
-
|
Added the Statistics page. Shows the breakdown
of different Sources and different Agencies.
This is done by a live search on the database
and not periodically pre-genned. The reason why
the Total figures may not be same
everywhere is because the updates are constantly
occuring and the database was updated during the
time between each of the queries generating the
reports was run. Not much else added today...
|
|
28/11/2001
|
|
-
|
Changed the Statistics page around a bit. It now
includes the rolling stats for the week so far
(whole database), statistics for the current
day (as of midnight) ans statistics for the
previous day.
|
|
-
|
Had a brief email exchange with some of the
MySQL development crew. Apparently, the 4.0.1
release will have a re-worked Full Text Search
engine for boolean queries (the search type that
is currently used by
JobsMeta)
which should make it MUCH faster. In the meantime,
I'm looking into investigating increasing the
number of stop words in order to speed up Full
Text Search queries. The downside is that in
order to do this I will have to recompile
MySQL. Not that that is a problem.
|
|
-
|
Just recompiled MySQL after adding a bunch of
frequently occuring stop words. It doesn't seem
to have made any difference... Oh, well. I guess
I'd better wait for MySQL 4.0.1...
|
|
-
|
Added the ********* module to the pool. Now I
have to investigate if there are any other job
boards out there that are worth looking at...
|
|
30/11/2001
|
|
-
|
Added the 4Weeks module to the pool. It doesn't
seem to be doing very much, as it has only
contributed 13 jobs today... Quite poor.
|
|
-
|
Added some Agency filtering to the upload stage.
Some agencies post under several different
variants of their name (e.g. with and without
the 'Ltd.' at the end). The new filter handles
that for some agencies that are contributing the
greatest number of jobs. More will be added as
time goes on.
|
|
-
|
Spend most of the time yesterday looking for other
job search sites. There are quite a few of them
out there, but most of them are quite severely
lacking in content. A total of 20 new sites have
been identified that might be included at some
point.
|
|
-
|
Another thing I have been thinking about is the
prospect of somehow filtering out the duplicate
messages. This problem occurs due to programs such
as WebSalvo and Conkers that allow agencies to post
jobs to a number of different search sites.
Unfortunately, for
JobsMeta
that means duplication - potentially quite a lot
of it, too. I've been considering ways to filter
out the duplicates, and to prioritize adverts,
for example, by source (which job site they came
from).
|
|
-
|
This prioritarization is likely to be initially
performed by the rule of favoring the sites with
most content. This is primarily for the reason
that is is the most quantifiable measurement I
can think of. This method is going to be changed
as and when I get contacted by people running the
job sites (those who appreciate this site will
move higher, thus overwriting the lower ranking
entries from other sites). Of course, certain
sites are likely to get dropped, as and when
those in charge of their management decide that
they don't want people to find jobs more
easily, even if they do get the credit...
|
|
-
|
The secondary optimization method is likely to
be the time when the advert is posted. The later
advert is favoured more.
|
|
-
|
Unfortunately, this duplicate filtering is likely
to be more difficult than it may seem. It is
likely to require huge amounts of processing
time to perform reasonably reliably. Every job
will have to be checked against every other job.
Ideally, this should be done at insertion time,
because that way the number of jobs in the
database would be kept to a minimum, thus
yielding a faster checking time in the average
case. Of course, doing that once per day would
be much faster and more convenient, but most
jobs would loose their "freshness" by then, so
it wouldn't actually help for most users.
|
|
-
|
Currently I'm thinking about just doing straight
word counting. It is a rather crude way of
handling this, but it is also quite effective, and
it should be fairly fast, with some clever use of
Hashes in Perl. The Hashes could even be stored
with the data in the database, thus avoiding the
parsing of the message evey time - instead it
could just be loaded pre-parsed.
|
|
-
|
For jobs that appear to be close, some sort of
secondary matching could be performed. Wording
structure, for example. Use a moving window of
two words and compare them. This should be far
more effective in finding duplicates. It should
be significantly slower than the first method,
so this may well get implemented as the only
comparison method.
|
|
-
|
It should also be possible to break the problem
down to small pieces, by separating the jobs by
the Agency that posted them. This should
make the comparisons execute much faster, as
there are much fewer records to compare to each
other.
|
|
-
|
I will sort this out at some point, when the
number of duplicates starts getting in the way.
It is likely that I will have to add a few more
crawler modules before that starts happening.
|
|
-
|
Changed the update timing. During the peak hours,
it is set to 1 minute. One hour outside peak hours
it is set to 2 mintes. At all other times, it
is set to 5 minutes. This should alleviate some
of the stress on the servers, and help them get
on with more useful work over night.
|
|
02/12/2001
|
|
-
|
Had an interesting event today - the database
server crashed. Not just the MySQL deamon - the
whole system. Bizzare. Mind you, it is rather
hot in this room with all the computers, and
they are clocked a bit out of spec... The web
servers also had a problem a couple of days ago.
If this happens again in the next week, I may
have to clock them down to 166 MHz...
|
|
-
|
Having said that, the web server went on overload
today with all the data acquisition. That shouldn't
have really happened. Could have been caused
by network congestion... Either way, it doesn't
matter. All these problems will go away when this
is all moved to a production server.
|
|
04/12/2001
|
|
-
|
Argh! The threaded architecture I used for the
updating script has proven to be too much for the
server to handle. Basically, the updater executed
update from each site in a separate thread. With
lots of CPUs, and lots of memory, this would have
been significantly faster. In the real world of
a limited server, though, 128 MB of RAM and 150 MB
of swap has been blown during the congested
periods. The problem was, as far as I can see,
that during the heavy periods more needs to be
downloaded and extracted. Certain threads took
longer than the update interval to finish, and the
parent thread was waiting, for the purpose of
consistency - I didn't want to disconnect the
parent thread. That means that a parent thread,
nearly the same size as the childred was sitting,
waiting for the children, wasting memory.
Eventually, the point of goodput collapse on
memory was reached, and the whole thing just
stopped. Eventually, the watchdog just rebooted
the server, after the CPU load went through the
roof.
|
|
-
|
I have reverted the updater back to a
single-threaded design, but kept the frequency
the same. I am hoping that the simultaneous
instances will overlap and interleave reasonably
nicely, with a minimum of work duplication. The
current execution time is around 2 minutes, but
this shouldn't be a problem, because the
new thread is started every minute, so at least
statistically, the catchment rate should be the
same and duplication should be minimal (in the
long term, anyway).
|
|
-
|
The bonus is that the whole update now required
about 11 MB of RAM, instead of about 65 MB
(40 MB shared, so really only about 30 MB total).
Needless to say, with 'only' 128 MB of RAM, it
wasn't that difficult to blow things out of the
water. Now that should be much more difficult.
So far it seems to be working quite nicely, but
I won't know for sure how beneficial it is until
the peak 11:30 - 13:00 period tomorrow... Other
pros and cons are that it is now nicer to the
bandwidth, as there is only one download
happening at any one time. The downside is that
this makes it more voulnerable to bandwidth
congestion, as if the current download blocks, it
will block the execution of the entire process,
rather than just the download for the single
web site. More finely grained downloads would
be nice, but I suspect that it will have to wait
until the new server is set up - if the
difference turns out to be significant in the
first place.
|
|
05/12/2001
|
|
-
|
Umpf... Had another server crash. The retrieval
server this time. Slowed it down further, to 180
MHz. Something gives, I just don't know exactly
what... If this fixes it, it is possible that it
is the bus that can't quite handle 66 MHz.
Bizzare, I know, and highly unlikely, but possible.
If this works out, I might try 210 MHz later. The
database server will be re-clocked either when it
crashes next time, or when the retrieval
server clocks a week of up-time - whichever
happens first. I expect the former, but hope for
the latter...
|
|
-
|
Another problem is that The server was running out
of steam at the peak period before this. The
memory problem seems to have been alleviated
somewhat, but I am looking at the diagnostic
display at the moment, and there are 15 copies
of the update application running. This indicates
a very serious backlog, and it is still increasing.
It remains to be seen whether it will clear itself
up on it's own, or whether a bigger server will
simply be necessary.
|
|
-
|
Nope, it just broke. The server ground to a halt
and rebooted itself. I have changed the cron
job so it slows down to one run every 2 minutes
in the 12:00 - 13:00 period during working days.
Not an ideal solution as it is likely to damage
the catchment rate, but there doesn't appear to
have been any other way to combat the problem on
the current server... Now that I have been forced
to do this, I think I will go back to the
multi-threaded parallel setup and see if it can
cope now that it is run half as often...
|
|
-
|
The answer appears to be yes, but guess what - I
won't know for sure until tomorrow. I seem to have
found out what one of the contributing factors
was. The JobServe site seems to be either down or
severely overloaded at the moment. I've just
tried accessing it from both an NTL connection
and a U-Net connection, and it just isn't talking
at all. This coupled with the single-threadedness
of the approach I was using was definitely a
contributing factor that halted the execution of
the whole update process - including the processes
that could have been running usefully... This is
another readon why the multi-threaded approach is
so much more desirable - a single site going away
doesn't break the whole thing as easily. True, the
download will eventually time out, but by the time
it does, the memory will get more and more and
more congested until the whole thing stops. It
will be interesting to see when JobServe will come
back to life...
|
|
-
|
Planet Recruit Seems to be having problems, too.
It does appear to have returned back to life, but
it is utterly grinding to a halt...
|
|
-
|
It also turns out that the memory consumption
difference between the threaded and non-threaded
method isn't as great as it initially appeared.
The difference seems to be at most around 20%.
|
|
-
|
With JobServe down and PlanetRecruit only
responding occasionally, the listings of today's
jobs on
JobsMeta
are likely to suffer... Oh, well, such is life.
PlanetRecruit seems to be bouncing back up...
Slowly...
|
|
-
|
Need new server... Or do I...
|
|
-
|
JobServe has bounced back to life. It's likely to
go on overload soon, as an hour and a half's worth
of jobs at peak time (a few thousand at least)
is going to hit it pretty quickly. It'll be
interesting to see if it falls over again under
the load.
|
|
06/12/2001
|
|
-
|
I have identified the causes of some of the
crashing experienced by the update server. It
would appear that it is related to the network
card. The database server doesn't seem to be
having the same problems and the OS has literaly
been cloned (not installed) onto it, so the
two are identical in that regard. The crashes
weren't actually 'crashes' as such. The machine
would just loose network connectivity. I foolishly
failed to check the routing tables before
rebooting the system. The bizzare thing was that
resetting the network interface, and even
reloading the module didn't seem to fix the
problem. Rebooting the machine, however, did.
That pretty much leaves me with the impression
that it is either the routing tables getting
corrupted (less likely), or the network card
breaking (more likely).
|
|
-
|
Either way, I'll replace the network cards and
see if that fixes the problem. Kind of makes
me feel slightly futile when I'll be building
a new server to replace this one anyway...
|
|
07/12/2001
|
|
-
|
It would appear that have found the cause of
the problem for random network-loss problems
on the update server. Both the web and the
database server have been tuned down to 180
MHz, and it all seems to be working perfectly
now.
|
|
-
|
The update time has been turned down to 2
minutes all round. According to the
statistics I have reviewed, the loss-rate
due to this seems to be minimal. Now there
are only two time bands for updates - the
5 minute band (20:00 - 08:00) and the 2 minute
band (08:00 - 20:00). Same banding has been
applied to weekends. This makes the cron
schedule file much simpler. It should also
yield a considerable saving in bandwidth,
prevent extreme peak hour congestion, and even
allow some breathing space for adding more
plug-in modules for additional search
sites - even though most of them don't seem
to be carrying enough jobs to make their
inclusion too worth while...
|
|
-
|
Increased the storage time to 2 weeks, mainly
for statistics gathering purposes. Added
statistics readout for last week and current
week, to allow for comparison and general
observation in the movements in the market.
The downside is that the statistics page is
now even slower, and it even times out
sometimes.
|
|
-
|
Got fed up with the statistics page being too
slow. There is far too little (if any) benefit
in keeping things for longer than one week
anyway.
|
|
-
|
The 2 minute update seems to often max out on
downloads from TechnoJobs. Others are doing
reasonably OK. When the new server is in
place, the update will have to be turned back
up to once a minute.
|
|
08/12/2001
|
|
-
|
The weekend updates now run on the same
schedule as the working day ones. The
reason for this is that a certain traffic
pattern has been observed that indiates
catchment failures and overflows when the
2 minute interval is used.
|
|
-
|
Small HTML updates for the statistics page
to make it work better with Internet Explorer.
|
|
-
|
MySQL server has been reconfigured slightly,
with some memory allocation changes and Full
Text Search tweaks. The result is an
approximately 100% improvement in performance
of Full Text Search features, and a noticeable
improvement in performance of the statistics
page.
|
|
-
|
Not a single glitch has occured since the
servers have been tuned down to 180 MHz. I
think the cause of the problems is becoming
fairly clear. There is the 192 MHz setting
which could still be tried in a final attempt
to get the last bit of performance from the
system, but it is possible that this would
either cause instability again, or yield
in actually reduced performance, due to the
bus being slowed down further, down to 55
MHz. Either way, things are working now.
|
|
-
|
With a bit of luck, MySQL v4.0.1 will bring
enough improvement on the Full Text Search
front that the new server will not be necessary.
That would be quite nice...
|
|
-
|
Changed the database layout slightly. The
Posted field is now DATETIME instead of
timestamp. The Retrieved TIMESTAMP field is
now auotogenerated by MySQL at insertion
time, which is exactly what I want anyway.
This has simplified the data uploading code
slightly, and probably gained a tiny bit of
performance.
|
|
11/12/2001
|
|
-
|
Made a small change to the way indexing is
done in the database. It should improve the
performance of the job uploader.
|
|
-
|
Added a download module for the Computer
Staff web site.
|
|
-
|
Added the
Downloads
Page. This is where the full source code for
JobsMeta
can be downloaded. The snapshot is updated by
a daily cron job at midnight. The source code is
released under the GNU Public Licence.
Not all the files have been updated clearly to
state this in the headers yet, but it will be
done at some point.
|
|
-
|
Tidied up some job retrieval code - things
should again be a really tiny bit faster.
|
|
12/12/2001
|
|
-
|
Added the CVStore retrieval module.
|
|
-
|
Adding the last two retrieval modules seems
to have pushed the current server right up to
it's limits. It is nearly going on overload
with the one minute update interval at several
other times of day, most notably around
16:30-18:00. This means that I will now have
to slow down the whole daily update to 2 minutes.
This is the only sensible option, really. At least
that way there is some spare capacity for the web
server to run on the same machine. Obviously, the
catchment rate will suffer fairly seriously,
especially at peak times, but since this is all
just a development system, it should be OK for
the time being. The production system will be
running on a much bigger server that should
easily be able to cope with all three sides
of the application (web server, job retrieval,
database) with loads of capacity to spare. Oh,
well - I guess this finally makes the decision
of "new server vs. no new server" for me...
|
|
-
|
Reviewed the idea of incorporating some of the other web sites
to the download list. Most of them don't seem to be worth it,
with a grand total of jobs posted over the last week totaling
to around 1 (I am NOT kidding) for some of them. Other seem
more promising, but are likely to prove a bit tricky to extract
data from. But, I'm sure it can all be done in a meaningful
way...
|
|
13/12/2001
|
|
-
|
Added the ITJobs4 retrieval module. This
definitely pushes the limits on the current
server. Even if the one minute update
frequency wasn't too much before, it definitely
would be now. I am just concerned that the peak
hour updates will fail, even with the timing
reduced to 2 minutes... On the positive side,
this one has reasonably decent content. Not
quite up to the volume of the heavies, but it
is about half way up the throughput range.
|
|
-
|
Added ITPaths retrieval module. The server
is starting to feel the strain...
|
|
-
|
Checked all the download modules, and made
sure that no unnecessary libraries are included.
This should make the runtime memory
requirements slightly smaller. And oh, does the
server need all the help it can get...
|
|
-
|
All this, and it's barely the afternoon! :-)
|
|
-
|
16 known job sites remain that have not yet
been added to the search. At least a quarter
of those are probably not worth adding due to
their small throughput. Some of the remaining
ones may be difficult to extract data from.
Either way, things are definitely progressing.
|
|
-
|
Added ITJobLink retrieval module. 15 job sites
remaining. Damn, I'm having a productive day.
|
|
-
|
Increased the number of top agencies listed
from 20 to 25.
|
|
14/12/2001
|
|
-
|
Moved the progress report to the new News
page from the About page. Now there is a
separate News/Downloads page where the
Agencies page used to be. Since this site
is not intended for posting by Agencies,
the link was not serving any purpose anyway.
|
|
-
|
Some sort of "locking" is under consideration.
It could be done per update, or per thread.
This would potentially allow the update
time to be set to, say one minute, and the
following updates would then not start if
a previous one hasn't finished. Per update
locking would be simpler to implement. It
would just mean that if an update is started
before the previous one has finished, the
update new process would immediately abort.
Per thread locking would allow a much finer
granularity of the update process locking.
It would allow the locks to be placed for
each thread that relates to a download from
a specific search engine. This means that the
following updates would still execute for the
source sites that do not have other updates
currently running. This should be a much more
sophisticated and effective system than the
one initially used where the new process
simply killed the old one.
|
|
-
|
There is also the issue of detecting and
removing stale locks, and killing the
relevant processes when they take too long
(if they are running at all, and haven't
crashed). Nothing that cannot be solved,
though. Must go away and think about this
for a little while.
|
|
-
|
Added a REx retrieval module. This one
was a bit trickier than usual because
their layout structure is slightly dynamic.
But, it was possible to compensate for that
without too much trouble.
|
|
-
|
UMTSWork web site seems to use PRECISELY the
same engine as the REx web site. They are
so similar, including the CGI (ASP, actually)
parameters, that it almost smells fishy. In
fact, it couldn't be fishier if it came with
tartar sauce. Either way, it should make
writing the retrieval module a piece of cake.
|
|
-
|
Yup, I was right - identical. Even down to
the dynamic peculiarity of the REx page
code. Oh, well, makes my life easier. Need
I add that UTMSWork is now in the retrieval
list? :-)
|
|
17/12/2001
|
|
-
|
Added the ITJobChange retrieval module. Eleven
other known job search engines remain. Most aren't
worth the bother, though...
|
|
-
|
It would appear that most of the server overloads
are actually occuring as a consequence of what
appears to be a bug in the cron deamon. It
sometimes triggers more than once for a specific
execution. This means that sometimes there is an
explosion of activity for no aparent reason that
causes the whole system to overload. There is only
one thing I'm aware of that could be causing this
to happen - the hourly time-sync process. It is,
however, extremely unlikely that this would cause
a problem like this.
|
|
-
|
The problem detected in the previous item can be
overcome by the process locking mentioned earlier.
This means that the process locking will solve
the overload problem from two fronts, which means
better catchment rates and faster updates. Hence
it has moved to the highest priority item on the
TODO list. The only problem is that I am not
entirely sure what is the best way to implement
it...
|
|
-
|
Had another server crash, this time the database
server. Both servers have finally been turned down
to completely stock settings (166 MHz). I guess
the combination of CPUs and motherboards just isn't
up to being overclocked. Considering all the other
problems this setup was having, I should probably
consider myself lucky if it works even in stock
form...
The upshot is that even though the CPUs are clocked
to 10% less, the bus is clocked to 10% more, and
the bus bandwidth is probably a bit more important
in this application.
|
|
-
|
Process level locking has been implemented. It
implements what is sometimes called a 'Highlander
Solution' - there can be only one process executing
an update from a particular source at any one time.
The update interval has been turned down to 1
minute. The lock "staleness" has been set to 2
minutes. This means that if the lock is older
than 2 minutes, it will be deleted, and the new
update will start anyway. This should take care
of the peak time server overloads by only slowing
down the threads that are taking a long time,
but it should also limit the amount of interference
to the limit that the 2-minute updates would. In
other words, it will never be slower than 2-minute
updates, but it can go as fast as 1 minute updates.
The idea is to provide a self-adjusting setup
where we only provide the limits for how fast and
how slow the updates can work. After that, the
system will try to get as close to the fastest
setting that the hardware will allow.
I will experiment tomorrow with setting up the
system to attempt updates every 30 seconds. Cron
cannot deal with sub-minute intervals, but I have
an idea about how to work around that.
The really cool thing is that the locks actually
get checked and applied before the master process
forks to execute the updates. This means that
the memory requirements are kept minimal. Even
better, in order to keep the lock time down to
a minimum, the locks are removed inside the child
thread, just after the data has been retrieved, but
before the data has been uploaded. This should
insure that the locks are in place for as short a
time as possible, and the updates are actually
nearly instantaneous, because they are queued on
the database server side.
The problem is that if things go badly wrong,
very long running processes can still cause server
overloads. This is because the Locks aren't 'final'.
Instead, they have a 2-minute life-time. I may
Implement some sort of a stale process hunting
mechanism later to make absolutely sure that
things never have a chance to get out of hand.
But for now, this should be more than adequate, and
more sophisticated solutions are probably not going
to be necessary any time soon...
|
|
-
|
Made some modifications to the page striping in
all the modules. The page pre-processing for
data extraction should now be much faster. In fact,
it is so much faster that the whole update
sequence now takes under 40 seconds! Considering
that it was slower than that before I added half
of the retrieval modules there are here now, that
is a pretty serious improvement. :-)
|
|
18/12/2001
|
|
-
|
The new locking code is working absolutely
beautifully. However, it is rather worrying to
see exactly how often the double triggers
actually happen - according to the logs, it
happens on average once every 15 minutes or so.
That is fairly serious, and coupled with some
heavy peak traffic, I can see how it could cause
the server to go away for a rather long time.
|
|
-
|
It looks like the update frequency of 1 minute
is not quite sufficient for the full catchment.
********* has just maxed out at 20.
OTOH, some catchment failures are pretty much
inevitable with this kind of application. I'll
have to investigate how often this catchment
failure occurs, but right now it doesn't seem
to be particularly serious.
Unfortunately, if this is caused by what I think
it is caused by, even cranking up the updates
to 30 seconds will not help. So, it is probably
best to just leave things as they are now...
|
|
-
|
It looks like the biggest culprit for server
overloads during peak hours was actually
JobServe. And it isn't because of the number
of jobs they get posted during the peak periods
- it is because of the load they face from users
during the peak periods!
The number of jobs retrieved is rather small
(often less than 5 per update), but the lockouts
occur quite often. This implies that there is
a problem with the JobServe servers coping with
the load. Absolutely pathetic - I was really
expecting the 'best' job search engine in the
country to be able to cope better. But then
again, they are running things on ASP, and
probably everything else that implies, so it is
probably not really all that surprising... More
disappointing than surprising. ;-)
All this means that the locking mechanism that
has just been implemented is improving the
performance of
JobsMeta
in more ways than it was anticipated. :-)
|
|
-
|
Changed the site appearance slightly, to make
a slightly better comprimise between the
different browsers.
|
|
-
|
The web server has crashed AGAIN! Argh! I'm
stumped now. I thought I had the cause pinned
down to overclocking. It turns out that I must
have been wrong. Again, there is nothing in
the log files...
Now I'm back to square one. It could be a
faulty network card, motherboard, memory, CPU,
anything. I am going to have to keep an eye
on the database server very closely. If that
crashes in a few days as well, then it is
more likely a software problem somewhere.
I will have to check the RedHat site tomorrow
for an updated kernel and see if that fixes
the problem... I have a gut feeling that it
has something to do with SMP and RedHat
patched kernels. I may have to go and do
some thorough reconfigurations and upgrading
on my network to get the latest kernel running
on all machines. In fact, I'll probably install
just the clean source kernel without any patches
on all the machines, without all the gimmicky
patches... Probably a good idea to upgrade the
BIOS on all machines that need updates...
I guess some 'maintenance' was due anyway...
|
|
20/12/2001
|
|
-
|
Another server crash today! This is getting
ridiculous...
|
|
-
|
Right. Just upgraded the kernel. As clean
as it can be, only the RAID patches have been
applied. Let's see if that improves the
stability... It doesn't seem to have removed
the double-triggering cron problem...
|
|
21/12/2001
|
|
-
|
No server crashes yet since the upgrade. The
upgrade seems to have solved the 'unexpected
IRQ vector [???] on CPU#[?]!' error.
|
|
-
|
Doctored the wrapper execution shell script to
compensate for lack of functionality to run
things in intervals shorter than one minute.
Since it looks like overflows are still quite
common on some of the 'small buffer' source
sites, I have turned the update interval down
to 30 seconds. This should work just fine, as
the server overloads will still be prevented
by the locking mechanisms implemented earlier.
So, in the worst case, this shouldn't hurt
anything. Ultimately, the number of updates
per minute are limited by hardware and bandwidth,
and the software is now self-limiting so it
will not die horribly when it runs out of
resources.
|
|
-
|
D'oh! It looks like the sites that are overflowing
actually take so long to check that the faster
updates aren't actually helping. Instead of just
overflowing, the lock and overflow, which isn't
very useful. Switched off the 30 second updates.
Back up to 1 minute. Umpf...
|
|
-
|
Fixed a fairly nasty bug that in the JobSearch
retrieval module that caused some jobs to not
get inserted.
Fixed a similar bug in UMTSWork and REx retrieval
modules.
|
|
-
|
Made a little program to process the logs. This
should help identify how often overflows occur,
and on what sources. Since running updates twice
as often doesn't do the trick of compensating,
I will probably extend the needing retrieval
modules to dig deeper into the source when
overflows are detected, thus removing any
possibility of an overflow occuring.
|
|
-
|
The log processing program has revealed that
rather worryingly, overflows sometimes occur
even on JobServe (250) and GisAJob (100).
It appears that this sort of blocking is
occuring as a consequence of locking sometimes
causing every other update to be skipped.
Just re-tuned the locking to see if the system
can cope with the one-minute update load, if
there is no interference by the multiple
triggering (which locking removes anyway).
There also appear to be occurences of cron
updates triggering randomly between the minute
boundaries. Locking should take care of that
as well.
|
|
24/12/2001
|
|
-
|
Just upgraded MySQL to 4.0.1. The performance
improvements are negligible, which is quite
unfortunate. There is a new feature, however,
that allows query caching. This compensates
very well for accidental double clicking and
such. So, it sometimes helps in reducing the
server load. Doesn't do much else, though...
|
|
-
|
The servers have been up to 180 MHz for the last
four days and still no random crashes. It looks
like using a clean 'vanilla' kernel solved the
problem. Obviously some of RedHat's kernel
patches broke things on this particular system.
|
|
-
|
Pre-insertion logging facility has been added.
This should greatly reduce the number of false
overflow alarms, and false inserts (probably all
of them). It should also quite significantly
improve the performance, because the selects
on this fast lookup table are much faster (the
table is much smaller, and all fields are
fixed length). Updates are now down to under 30
seconds!
|
|
01/01/2002
|
|
-
|
Through some clever use of the perl "eval"
function, the main update function code has
been reduced to a tiny fraction of it's
previous size. It now all fits on a single
screen! :-)
|
|
-
|
CVStore drill-down implemented.
|
|
-
|
JobBoard drill-down implemented.
|
|
-
|
TechnoJobs drill-down implemented.
|
|
-
|
********* drill-down implemented.
|
.
|
02/01/2002
|
|
-
|
PlanetRecruit drill-down implemented.
|
|
-
|
ITJobChange drill-down implemented.
|
|
-
|
ComputerStaff drill-down implemented.
|
|
-
|
ITJobLink drill-down implemented. This one was
fairly nasty... So far this method has yielded
some fairly spectacular results. Not only has
the catchment been improved to perfection, but
the increase in throughput indicates that the
previous losses due to overruns were far more
serious than expected. More importantly, this
new feature has just opened the way to a new
application. Should this site ever be forced
to shut down due to objections from other
search engines, these modules will be able to
be re-used in a standalone user-downloadable
application a-la Copernic meta searcher that
performs on-demand cross-engine searches. Such
an application is likely to get developned
anyway once this web site has been completed.
:-)
|
|
-
|
ITJobs4 drill-down implemented. I've just
realized that they actually keep the SQL
statement in the form on the HTML page! Oh,
dear... Made me laugh, anyway.
|
|
03/01/2002
|
|
-
|
Fixed a bug ITJobLink retrieval module. I
hadn't noticed this before, but the jobs
there are posted in the ascending date order.
This means that the first page contains the
oldest jobs, and the retrieval has to be done
backwards. Unfortunately, this means that the
retrieval is much more inefficient, but it
cannot be helped.
|
|
-
|
Unfortunately, because ITJobLink is displayed
backwards, there is no point in updating it
at every run - it is just a waste of resources.
Instead, the ITJobLink update run has been
moved to the Daily script that runs just
after midnight. This way, all jobs will still
be caught, but the delay on them can be up
to 24 hours. Sorry, but there's not much more
I can do while keeping things efficient. The
server is being pushed to the limit anyway...
The upshot is that a significant amount of
bandwidth will be saved, memory consumption
reduced, and updates speeded up.
|
|
04/01/2002
|
|
-
|
ITPaths drill-down implemented.
|
|
-
|
TechnoJobs have changed the way their
web site works in a few details. Email
address is no longer displayed, presumably
in an attempt to stop page ripping. [ahem]
They now use a redirector script that is
supposed to just fire up a mail window. This
probably works for IE users on Windows, but
falls over flat on it's face on most other
things.
Either way, this has been compensated for,
and the email addresses are again being
extracted from the TechnoJobs site.
|
|
-
|
Changed the download scheduling a bit.
Now the scheduling runs at 1 minute intervals
during the 02:00-03:00 period. The reason for
this is that there seems to be a huge surge in
job uploads to GisAJob during that time,
especially just after 02:00. This is the only
way to compensate for overflows like these on
GisAJob (and JobServe, 4Weeks, and JobFizz).
According to the numbers I'm looking at, this
surge may well have been causing a loss of
around 1000 jobs every day, from GisAJob, that
weren't being retrieved.
|
|
06/01/2002
|
|
-
|
REx drill-down implemented.
|
|
-
|
UMTSWork drill-down implemented.
|
|
-
|
Unfortunately, it turns out that both REx and
UMTSWork implement things in the same broken
way in which ITJobLink does. In fact, they are
even worse - they limit the results to 200. This
means that daily, it is only possible to get
the FIRST 200 posted jobs AT BEST. Still, that
is better than the current catchment rate...
These two will therefore also be moved to the
daily update script, as there is no point in
parsing the same pages over, and over, if there
is no chance of catching new data...
Either way, this finishes the drill-down feature
implementation. Rather convenient, too, because
I start a new job tomorrow, and I may be unable
to do any more work on this site for a while.
|
|
07/01/2002
|
|
-
|
Applied a derivation of the depth retrieval
method to the 4Weeks module. Now it retrieves
the last 25 jobs. This should make the number
of retrieved jobs skyrocket. The current numbers
imply that 4Weeks is a very good source, content
wise.
|
|
-
|
Modified unique indexing mechanism. There is
now only one unique index on the Jobs table.
For the sites where the Remote IDs don't apply,
they are achieved by concatenating the posted
date, the reference number and the agency name.
This replaces the functionality of the previous
second unique index. The advantage is (hopefully)
that the indices are smaller, and hence faster.
|
|
16/01/2002
|
|
-
|
I've been lazy with updating this page
over the last week. The user account
maintainance section is now complete. Some
minor timing changes have been made to
make things run a bit more smoothly during
peak hours. All functions except those
directly responsible for Page/State handling
have been moved from the main JobsMeta
library into an Auxiliary library. This
should help keep things a bit more
maintainable.
|
|
17/01/2002
|
|
-
|
Finally, a brakethrough! By making use of
the "USE INDEX" clause for explicitly
specifying indices to be used by SELECT
queries, the searching performance has been
improved by approximately 10 times! Yes,
that is 10 TIMES in the 0-day case. The
improvements reduce linearly as the date
range is increased, but in the general 0/1-day
case, the improvement is rather spectacular!
The query time is now in the sub 1 second
range! The only explanation I can think of for
this is that the FULL TEXT is performed indexed
regardless of other indices used. This is a bit
unintuitive, because MySQL is supposed to only
use one index per table per query. This appears
not to include the FULL TEXT indices. By
pre-reducing the data set through the Retrieved
and Type fields, the usually slow FULL TEXT now
only needs to look through very few records, thus
yielding a massive speed improvement. :-)
This has probably been the biggest stumbling
block so far, as far as the performance is
concerned.
|
|
18/01/2002
|
|
-
|
Did some additional performance testing
on the new indexing method. The old
FULLTEXT only search has a flat query
time of around 10 seconds. This makes it
the fastest search for 2+ days worth of
data. At 2 days, the score is tied,
so FTS is used by preference.
However, the search by Retrieved field
is drastically faster in the 0/1-day case.
The measure performance is at 1-2 seconds
for the 0-day case, and 4-8 seconds in the
1-day case, depending on the other options
specified.
|
|
-
|
Oh, the annoyance... It looks like there is
a bug in MySQL that makes the new and improved
indexing I tried to implement fail. It is
faster, but unfortunately, the results it
returns are wrong. Oh, well... Back to the
old and slow way of doing things. At least
it is reliable...
|
|
20/01/2002
|
|
-
|
Yes, it has finally happened! The bulk
application is now partially implemented!
The bulk application works for all the
jobs that have the application email address
listed. There are some job sites, such as
PlanetRecruit, JobBoard and CVStore that do
not list an email address, but instead provide
an on-line application form. These forms are
not supported at the moment. The support for
use of those forms for automated application
is pending, and will be implemented shortly.
|
|
21/01/2002
|
|
-
|
Replaced the home-grown binaries for MySQL
with the ones from the MySQL web site, and
installed the RPMs. Hopefully, this will
solve the problem of random server crashes
that I have been experiencing recently...
|
|
-
|
D'oh! It looks like I will HAVE to compile the
package myself after all. I need to modify the
source to get some relevant key words out of the
stop-wort list... Well, last time it was built
with fairly extreme optimisation parameters,
such as -O6 and a few other things. Maybe it
will be sorted out now. Also, this time I am
building it with the RedHat GCC-2.96. This
could allegedly cause problems, but since
PGCC-2.95.2 compiled version seemed to have
problems, the options are running short...
I've taken the opportunity to strip out the
unnecessary default functionality.
|
|
23/01/2002
|
|
-
|
Random crashes are starting to plague the
system again. I am seriously starting to
suspect that I have a pile of dodgy hardware.
I'm almost prepared to point the finger at
the old GA-586-DX boards. Ah, whatever.
Will get a new server for the whole system
anyway.
|
|
-
|
Added a new statistics page that breaks down
the job throughput per hour. This should
provide good, reliable information about peak
periods, and help tune up the cron job to
maximise the catchment rate, and at the same
time minimize the bandwidth requirements
where possible.
|
|
24/01/2002
|
|
-
|
It is still unclear as to what may be causing
these random crashes. So far, they have been
occuring within 24 hours of a reboot, for
the last several days. That is, needless to
say, very, very wrong, especially since
nothing on the system has changed. I have
set up up2date on the system, and updated to
the latest versions of all the packages and
rebooted the servers, but I somehow doubt that
will fix things. If this doesn't work out, then
I am out of ideas. The one last thing I could
attempt is upgrading to RedHat 7.2. This may
solve some of the problem, due to the newer
v2.4 kernel series, but I was really hoping to
put off upgrading to v2.4 for quite a while,
until the development on the next production
release starts. Otherwise, I'm likely to face
constant upgrading. For a long time...
|
|
26/01/2002
|
|
-
|
Added a CWJobs retrieval module.
|
|
-
|
Griffin Internet have started doing
a 1 month minimum length contract on
ADSL, so the upgrade to a properly hosted
web site is fairly imminent. I will order
the installation on Monday.
|
|
29/01/2002
|
|
-
|
The solution to duplicate adverts has proven
to be much simpler than I thought. Implementing
a unique index over the Agency name and the
first 255 characters of the job description has
yielded trimming of about 20% of the jobs in
the database. This is a good thing in many
ways.
1) It gives a much fairer picture of how many
jobs specific agencies are posting.
2) It gives a much more accurate picture of
how many jobs are on the market at the moment,
thus more accurately indicating the state of
the industry.
3) It removes the very irritating situations
where the same job appears on the same page
half a dozen times, just because the agency
has posted it to multiple job sites. (This
is the primary reason the feature was implemented
- to increase the signal / noise ratio in the
job adverts.)
4) It reduces the number of records in the
database without reducing the number of
actual jobs. This means that it puts much less
strain on the system, and makes the searches
faster.
5) Only the first of the duplicate jobs gets
posted. This means that it will get listed
on the site that displays it first, thus
favoring faster sites in the statistics view.
|
|
-
|
The re-implemented Agency name filtering
(strips off the trailing ltd/limited/plc) should
provide a further reduction in duplication, but it
will take a few days for the system to start being
effective.
|
|
-
|
The object-oriented porting is going quite
nicely, even though I have encountered a problem.
For some obscure reason, the multiple inheriting
classes clobber each other at the end program
level. I'm a but stumped as to why, but I've
worked around it by importing classes after
forking the threads. It would be nice to avoid
that, but it will do for now, until I can figure
out what exactly is going wrong...
|
|
04/02/2002
|
|
-
|
Finally, the JobSearch retrieval module has been
converted to use the LWP library instead of the
ugly Lynx shell request. The problem was that the
JobSearch site used non-standard field separators
that were getting incorrectly converted for the
HTTP GET request. The object orented port is
over 2/3 complete. Once this is complete, I will
finnish off the bulk application functionality to
include the web-form based application for cases
where it is the only option.
|
|
05/02/2002
|
|
-
|
Made further enhancements that will aid the
filtering of duplicate jobs.
|
|
-
|
Overloaded the constructors to avoid the
'use' clauses inside the fork blocks. It is
probably more elegant than the old workaround...
|
|
-
|
The filtering of duplicates is working quite
well. It now appears that the total number of
jobs posted may have contained as much as 30%+
of duplication. Of course, the reliable measure
of this will not be attainable for another week,
but preliminary results are very promising.
|
|
06/02/2002
|
|
-
|
Implemented a further feature to prevent
server overloads. Now new instances of the
update program will not be executed if there
are already more than 20 running. This
should remove any possibility of server
overloads.
|
|
10/02/2002
|
|
-
|
After suffering a complete raid stripe failure
yesterday, and rebuilding the complete database
server, it looks like I have finally found a
reliable workaround for the stability problems
that have plagued the system for months.
After searching through countless archives, it
appears that there is a reasonably well known
bug in the GA586DX motherboards. The APIC
is to blame. Disabling it fixes the problem.
So far, so good. The retrieval server has
already been up for longer than usual, and there
are no odd occurences of unkillable processes.
This just may have fixed things.
|
|
13/02/2002
|
|
-
|
Added a config file to allow easier configuration
of some plausibly tweakable parameters.
|
|
-
|
The stability of the servers has so far been
very good. It looks like disabling some of the
SMP APIC functionality did the trick.
|
|
23/02/2002
|
|
-
|
Updated the WorkThing retrieval module. They
changed the page layout on the site, so the
module broke. All is working again.
|
|
-
|
Modified the way forking and locking works.
Now the locking checks for the number of update
processes/threads already running. If there are
too many (configureable), it will not kill the
old process lock and re-run. This provides a
rather effective throttling feature. The main
update program has now been modified to do
locking and forking separately. This has the
advantage that the locking code doesn't count
the threads already running from the current
update run, so the throttle process limit is much
more easily and precisely controlled. This should
reduce server overloads and at the same time
make the program more configureable. There is
now an option for this in the config file.
|
|
02/03/2002
|
|
-
|
Overhauled the authentication and session
verification code. It is now much more
compact and maintainable. All the Session
and Cookie management is now performed in
a single, compact function. The modification
will make the application smaller and faster.
|
|
-
|
The feature to save user-defined queries
has been implemented. This pretty much
finishes off the feature set. The remainder
can be implemented as and when. Probably
after the ADSL is installed and the domain
name is changed to the real one.
|
|
05/03/2002
|
|
-
|
Wehey! First user feedback today. Updated the
filtering and improved the classification of
Contract/Permanent jobs for the jobs that were
not being classified ('Either' type).
|
|
-
|
Implemented the unique reference number filtering.
|
|
23/08/2002
|
|
-
|
By popular request, a feature to allow location
sarching has now been implemented. It is not as
advanced as I had hoped for it to be, unfortunately.
It has been implemented by adding the Location field
into the Full Text Index for searching. Whether this
solution is any worse than what I had planned is
not really very clear, but even if it is not as good,
a rudimentary ability to include location searching
now exists.
|
|
-
|
Added a query example to the search page.
|
|
-
|
Upgraded MySQL to 4.0.2. This allows for a few
performance increases. The core of the DB
isn't proving to be any faster, but there are a few
bug fixes that now allow for some hand-optimisations
to work in few places where they used to completely
break the queries.
Please forgive any instability you encounter while the
new installation stabilises a bit. Some of the new
optimisations may take a while to test thoroughly, although
they seem to work just fine. Please report any
unexpected problems.
|
|
31/08/2002
|
|
-
|
Added GoJobSite retrieval module. This was a bit awkward
because the drill-list can only be accessed if cookies are
enabled. To keep things generic, the Generic module was
upgraded to pass all cookies the site sets when they are
required.
|
|
02/09/2002
|
|
-
|
Re-wrote the 4Weeks retrieval module. It now runs as a
single daily update. This should save both memory and
bandwidth. I only relised this today, but it can be done
in EXACTLY the same way the JobFizz retrieval works.
|
|
-
|
Typical. Six months of nothing and then two MySQL updates
get released in a matter of weeks. Will have to upgrade
the server to MySQL 4.0.3... Maybe not today, though...
|
|
20/09/2002
|
|
-
|
Suffered disk failure on the web server. One SCSI disk failed
with a bad block, and took down the RAID stripe and some data
with it. All back now. All the disk seemed to need is a
low-level re-formatting, and a media scan to re-map the
defects.
|
|
-
|
Added a feature that highlights the searched terms in the
result set. This should make it more obvious how suitable each
job is.
|
|
25/09/2002
|
|
-
|
Added another highlighting feature. Since MySQL now supports
term* type matches, this functionality has been added to the
highlighting code.
|
|
-
|
Implemented a small optimisation in the GisAJob retrieval
module. Before it retrieved all jobs, regardless of the
"sector". Now, it only retrieves the IT jobs. This should
make a small saving in bandwidth.
|
|
26/09/2002
|
|
-
|
Made an optimisation in the handling of the multi-page results.
All the records, regardless of the page number, used to be
retrieved from the database. This was inefficient: it ate up
a lot of memory in the server process, and all the records had
to be parsed for the recently implemented highlighting. Now the
SQL queries have been modified to use the LIMIT/OFFSET clause
to insure that only the records for the particular page are
returned.
|
|
-
|
Fixed an obscure potential security hazard.
|
|
27/09/2002
|
|
-
|
Added TheITJobBoard retrieval module.
|
|
02/10/2002
|
|
-
|
Upgraded MySQL to v4.0.4. Amazing. 6 months
with no updates, and then v4.0.2, v4.0.3 and
v4.0.4 all come out within a month. I guess
this is a good thing...
|
|
TODO List
|
|
-
|
Process locking to prevent server overload
and improve catchment rates.
|
Done!
|
|
-
|
Sub-1-minute updates.
|
Done!
|
|
-
|
Some kind of 'RemoteID' quick pre-insertion,
so the follow-up updates don't duplicate
the work. This should be reasonably easy to
implement soon, and it should improve
performance and save bandwidth quite
significantly, especially during the peak
periods.
|
Done!
|
|
-
|
Selective depth search on retrieval.
|
Done!
|
|
-
|
User account maintenance.
|
Done!
|
|
-
|
Update everything into a more object oriented
design. This should improve readability and
maintainability. It will also massively reduce
the code size, due to inheritance - there is
a LOT of code duplication between most of the
retrieval modules...
|
Done!
|
|
-
|
Break configurable things out of the
code and into a human-readable config file.
|
Done!
|
|
-
|
Feature to save user-defined searches.
|
Done!
|
|
-
|
Duplicate advert nullification.
|
Done!
|
|
-
|
Bulk application.
|
In progress!
e-mail application works.
Need Separate application routines for each
web-form submission...
|
|
-
|
Location searching.
|
Done!
|
|
-
|
Sign-up e-mail verification.
|
Pending...
|
|
-
|
Track keeping of most popular keywords.
|
Pending...
|
|
-
|
Statistics about jobs that members apply for. This will
later enable the creation of an AI that will recognize
what jobs a member might find interesting, and alert
the member of "highly matching" jobs.
|
Pending...
|
|
-
|
Instant e-mail/SMS verification for closely
matched jobs.
|
Pending...
|
|
-
|
More retrieval modules. ATSCOJobs and TheITJobBoard.
|
TheITJobBoard Done...
|
|
-
|
Re-skinning to something that looks a bit nicer.
|
Pending...
|
|
-
|
Highlighting of search terms in the text body will
be moved to client-side, using JavaScript. This will
save the server from doing a fair amount of work, thus
making it able to serve more requests. It should all
still work on all commonly used browsers.
|
Pending...
|
JobsMeta
News
|