TechnicalOperations Status Report

Project Operations from 2017-07-01 to 2017-08-11

Help

Network Operations 127295 mr1-ulsfo booted from backup JunOS Screep Done None
Network Operations 155875 asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues Screep Done None
Network Operations 171714 "MySQL server has gone away" from librenms logs Screep Done None
Network Operations 156957 asw-d-codfw public1-vlan addition review (blocks gerrit2001) Screep Done None
Network Operations 143915 Fix static IP fallbacks to Pybal LVS routes Screep Done None
Network Operations 118259 Figure out the source of QSFP+ errors with DAC + MX480 Screep Done None
Network Operations 133852 analytics hosts frequently tripping 'port utilization threshold' librenms alerts Screep Done None
Network Operations 130840 setup/deploy server analytics1003/WMF4541 Screep Done 8.0
Network Operations 145270 Telkom/8ta (South Africa) users cannot connect to wikimedia sites Screep Done None
Network Operations 146391 eeden ethernet outage In-Scope Open None
Network Operations 122406 Consider renumbering Labs to separate address spaces In-Scope Open None
Traffic 170192 remove eventdonations.wikimedia.org CNAME Screep Done None
Traffic 135515 graphite.wikimedia.org 503s on some css/js resources Screep Done None
Traffic 132464 HTTPS redirects for transparency.wikimedia.org Screep Done None
Traffic 157353 prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet In-Scope Done None
Traffic 146451 repeated 503 errors for 90 minutes now Screep Done None
Traffic 164579 Investigate nginx reload behavior In-Scope Done None
Traffic 93927 Make OCSP Stapling support more generic and robust Screep Done None
Traffic 89688 Varnish GeoIP is broken for HTTPS+IPv6 traffic Screep Done None
Traffic 133217 Fix apache-2.4 + DHE ciphersuites issue Screep Done None
Traffic 131501 Convert misc cluster to Varnish 4 Screep Done None
Traffic 82747 pybal health checks are ipv4 even for ipv6 vips Screep Done None
Traffic 118787 releases.wikimedia.org should be https only and have hsts set Screep Done None
Traffic 134464 Make RB ?redirect=false cache-efficient Screep Done 0.0
Traffic 127492 Switch ulsfo to backend to codfw rather than eqiad Screep Done None
Traffic 132462 HTTPS redirects for parsoid-tests.wikimedia.org Screep Done None
Traffic 148780 mobile-safari has very few internally-referred pageviews Screep Done None
Traffic 154801 Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently In-Scope Open None
Traffic 152882 Many misc wikis lack mobile domains In-Scope Open None
Traffic 148976 Strongswan Icinga check: do not report issues about depooled hosts Screep Open None
Traffic 119372 Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. In-Scope Open None
Traffic 137979 Support brotli compression In-Scope Open None
Traffic 141266 letsencrypt puppetization: add parallel rsa+ecdsa cert support In-Scope Open None
Traffic 141480 mixed-content issues on planet.wikimedia.org Screep Open None
Traffic 144508 Point wikipedia.in to 205.147.101.160 instead of URL forward In-Scope Open None
Traffic 164259 Add VSL error counters to Varnishkafka stats Screep Open None
Traffic 120121 Improve Varnish XFF processing for trusted proxies In-Scope Open None
Traffic 120486 add a https-only option to dynamicproxy In-Scope Open None
Traffic 79730 Add pybal check to ensure service IP is bound Screep Open None
Traffic 168529 Upgrade to Varnish 5 In-Scope Open None
Traffic 102178 Fix RESTBase support for wikitech.wikimedia.org Screep Open None
Traffic 106517 upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 In-Scope Open None
Traffic 166782 wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there Screep Open None
Traffic 75944 Monitor Varnish caches on beta cluster have two varnishd process running In-Scope Open None
Traffic 109331 Deleted files sometimes remain visible to non-privileged users if permanently linked In-Scope Open None
Traffic 109776 Tilerator should purge Varnish cache In-Scope Open None
Traffic 112316 Configure varnish to use "Unconfigured domain" page for 404 Not Served (instead of generic error) In-Scope Open None
Traffic 112765 Phabricator needs to expose notification daemon (websocket) Screep Open None
Traffic 165765 Refactor pybal/LVS config for shared failover In-Scope Open None
Traffic 164868 SSL error for https://wikispecies.org/ In-Scope Open None
Traffic 128559 store.wikimedia.org HTTPS issues In-Scope Open None
Traffic 164609 Merge cache_misc into cache_text functionally In-Scope Open None
Traffic 154702 Fix broken referer categorization for visits from Safari browsers In-Scope Open None
DBA 138460 Upgrade m3 (phabricator) db servers Screep Done None
DBA 150974 db2042 disk predictive failure Screep Done None
DBA 145630 Cannot delete two pages with large histories even having the appropriate permissions to do so Screep Done None
DBA 125215 Prepare db1018 and s2-slaves for s2 master failover Screep Done None
DBA 156905 Phabricator master and slave crashed Screep Done None
DBA 119056 External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available Screep Done None
DBA 170941 Global rename of user Moros Screep Done None
DBA 167031 Global rename of Idh0854 → Garam: supervision needed In-Scope Done None
DBA 141252 icinga hp raid check timeout on busy ms-be and db machines In-Scope Open None
DBA 134476 Decommission old coredb machines (<=db1050) In-Scope Open None
DBA 127570 Rename be_x_oldwiki database to be_taraskwiki In-Scope Open None
DBA 119626 Eliminate SPOF at the main database infrastructure In-Scope Open None
DBA 163143 dbtree: don't return 200 on error pages In-Scope Open None
DBA 50930 Database replication problems - production and labs (tracking) In-Scope Open None
DBA 168584 Labsdb* servers need to be rebooted In-Scope Open None
DBA 134809 Apache <=> mariadb SSL/TLS for cross-datacenter writes Screep Open None
Software Development 148494 Add shell scripts CI validations In-Scope Open None
Software Development 152950 E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter Screep Open None
Software Development 159045 Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api In-Scope Open None
Software Development 150560 More verbose messages from service-checker-swagger In-Scope Open None
Hardware Requests 154706 Codfw: (1) hardware access request for labtestneutron refresh Screep Done None
Hardware Requests 161753 eqiad: (1) hardware access request for labnodepool1002 Screep Done None
Hardware Requests 146455 Decommission labsdb1002 Screep Done None
Hardware Requests 164959 Decomission mw2098 Screep Done None
Hardware Requests 148513 codfw/eqiad: 2x systems for prometheus Screep Done None
Other Operations 132921 Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" In-Scope Cut None
Other Operations 172689 Install missing Spamassassin DKIM dependencies on lists and mx Screep Done None
Other Operations 86890 Setup imagescalers cluster in codfw Screep Done None
Other Operations 97758 SVG rendering with marker-element is different between librsvg and Inkscape Screep Done None
Other Operations 99105 Kafka Broker disk usage is imbalanced Screep Done None
Other Operations 116019 Rack/Setpup labvirt1010 and 1011 Screep Done None
Other Operations 116963 upgrade radium to jessie Screep Done None
Other Operations 117477 reclaim lawrencium to spares Screep Done None
Other Operations 120731 eventlog1001 access for user Madhuvishy Screep Done None
Other Operations 120870 Clean up some accidental restbase metrics Screep Done None
Other Operations 123285 remove gbyrd from exim alias file Screep Done None
Other Operations 123438 uwsgi puppet module does not seem to trigger restart when config is updated Screep Done None
Other Operations 123472 Site: (1) VM request for url_downloader Screep Done None
Other Operations 123646 remove exim alias keynote@ Screep Done None
Other Operations 123728 replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) Screep Done None
Other Operations 123796 Wes Moran not able to log into Graphite Screep Done None
Other Operations 124639 Metrics not reaching Graphite Screep Done None
Other Operations 125058 Prepare mathoid for the codfw switchover Screep Done None
Other Operations 125069 Create a service location / discovery system for locating local/master resources easily across all WMF applications Screep Done None
Other Operations 125565 Update Label for oresrdb1001 (WMF4577) & relocate and update label for oresrdb1002 (WMF4578) Screep Done None
Other Operations 126221 Evaluate efficacy of DateTieredCompactionStrategy In-Scope Done None
Other Operations 126242 Reduce the number of appservers we're using in eqiad Screep Done None
Other Operations 126283 Requesting restbase-roots access to RESTBase cluster for Petr Pchelko Screep Done None
Other Operations 127488 (re)move problemsdonating aliases In-Scope Done None
Other Operations 127489 donation aliases for moneybookers? Screep Done None
Other Operations 127493 remove / change blog@wm mail alias? Screep Done None
Other Operations 127720 Allow aqs-admins to deploy via scap using deploy-service ssh key Screep Done None
Other Operations 128463 New Service Request - Change Propagation Screep Done None
Other Operations 128639 Remove rbraceysherman@ from fr-all list Screep Done None
Other Operations 129222 Icinga disk space check should also check inode usage In-Scope Done None
Other Operations 131195 Bacula recovery of sql files from silver/wikitech fails Screep Done None
Other Operations 133966 missing /etc/ssl/dhparam.pem on jessie Apache using ssl_ciphersuite Screep Done None
Other Operations 134419 repair/replace pem1 in cr1-ulsfo Screep Done None
Other Operations 134866 Automatic monitoring checks for the MCS failing in production Screep Done None
Other Operations 136405 Jmxtrans failures on Kafka hosts caused metric holes in grafana Screep Done None
Other Operations 140646 deploy prometheus node_exporter for host monitoring Screep Done None
Other Operations 141957 ferm rules on icinga are broken, Screep Done None
Other Operations 143465 Access to people.wikimedia.org for Volker_E Screep Done None
Other Operations 144793 db1020 degraded array Screep Done None
Other Operations 145632 Separate 404s into their own log Screep Done None
Other Operations 149078 Upgrade firejail to 0.44 Screep Done None
Other Operations 149432 puppet compiler claims "no change" when catalogs are actually different In-Scope Done None
Other Operations 149981 Ask firejail upstream about ability to turn off pid namespacing Screep Done None
Other Operations 153278 Trying to scap while l10nupdate is syncing shows unhelpful error Screep Done None
Other Operations 154927 Deploy TwoColConflict extension to beta Screep Done None
Other Operations 155443 Update ci to nodejs 6 Screep Done None
Other Operations 161520 HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script Screep Done None
Other Operations 162770 SATA errors for stat1004 in the dmesg In-Scope Done None
Other Operations 162949 hosts with puppet compiler failures on every run In-Scope Done None
Other Operations 163721 Update wikitech-static and develop procedures to keep it maintained Screep Done None
Other Operations 165051 HHVM 3.18 segfault on jobrunner / string handling Screep Done None
Other Operations 165943 Access to search logs for Jan Dittrich Screep Done None
Other Operations 167157 rack/setup/install labtestpuppetmaster2001 In-Scope Done None
Other Operations 167871 Refactor maps puppet code to the role / profile paradigm Screep Done None
Other Operations 168444 Logo for sr.wikiquote.org Screep Done None
Other Operations 168782 Create fishbowl wiki for Maithili Wikimedians User Group Screep Done None
Other Operations 168892 rack/setup/install labtestservices2002.wikimedia.org In-Scope Done None
Other Operations 169360 Unresponsive/misconfigured iDRACs over the host-BMC interface In-Scope Done None
Other Operations 169485 Add support for directory environments to our puppet classes, production puppetmaster Screep Done None
Other Operations 169871 mgmt inaccessible on restbase1018 Screep Done None
Other Operations 169959 bast3002 didn't come up after reboot Screep Done None
Other Operations 170653 create netmon1003, migrate servermon from netmon1001 to netmon1003 Screep Done None
Other Operations 171280 wikitech api list=novainstances not returning list of instances Screep Done None
Other Operations 133656 Have a paging check for Nova API accessible In-Scope Open None
Other Operations 151045 Extending Yubico 2FA for production use (meta bug) In-Scope Open None
Other Operations 151049 Run systematic availability tests In-Scope Open None
Other Operations 151050 Proper documentation for Yubico 2FA for production use In-Scope Open None
Other Operations 152439 cronspam from labtestservices2001 /etc/dns-floating-ip-updater.py > /dev/null In-Scope Open None
Other Operations 152562 Port fundraising stats off Ganglia In-Scope Open None
Other Operations 133476 Proposal: Centralize OTRS login methodology Screep Open None
Other Operations 130617 Collect metrics on pool counter usage In-Scope Open None
Other Operations 97909 Upgrade jobrunners to redis 2.8 In-Scope Open None
Other Operations 153416 docker-engine pulled into our repositories only keeps the latest version In-Scope Open None
Other Operations 129841 Many minions fail to connect to salt master since 10:39 Screep Open None
Other Operations 129224 on labcontrol1001, /var/cache/salt has too many files! In-Scope Open None
Other Operations 126158 [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number Screep Open None
Other Operations 154915 Get rid of "import realm.pp" in manifests/site.pp In-Scope Open None
Other Operations 168619 Degraded RAID on lvs3001 In-Scope Open None
Other Operations 86081 Complete the use of HHVM over Zend PHP on the Wikimedia cluster Screep Open None
Other Operations 125442 es2009 degraded RAID Screep Open None
Other Operations 156136 Increase swift replication factor for accounts In-Scope Open None
Other Operations 124413 confctl should provide tags information after writing data In-Scope Open None
Other Operations 122825 Service Ownership and Maintenance In-Scope Open None
Other Operations 122210 Security audit for tftp on Carbon In-Scope Open None
Other Operations 158429 Switch to predictable network interface names? Screep Open None
Other Operations 121610 system users with UIDs > 500 Screep Open None
Other Operations 159242 Segmentation fault creating thumbnail In-Scope Open None
Other Operations 159536 Puppet constantly trying to stop the already stopped puppetmaster process on Trusty Screep Open None
Other Operations 160101 Upgrade php5-json .deb to at least 1.3.8 In-Scope Open None
Other Operations 160158 Make disabled accounts visible in the corp mirror LDAP replica In-Scope Open None
Other Operations 160941 Improve SSH access information in onboarding documentation In-Scope Open None
Other Operations 170740 PuppetDB misbehaving on 2017-07-15 Screep Open None
Other Operations 161528 incident 20170323-wikibase did not trigger Icinga paging In-Scope Open None
Other Operations 120943 Wikimania 2017 site does not automatically redirect to mobile site, when opening from a mobile device Screep Open None
Other Operations 161918 videoscalers (mw1168, mw1169) - high load / overheating In-Scope Open None
Other Operations 161920 logrotate for ruthenium In-Scope Open None
Other Operations 162029 Migrate all jessie hosts to Linux 4.9 In-Scope Open None
Other Operations 162123 Running swiftrepl is not puppetized Screep Open None
Other Operations 169035 bast3002 sdb broken In-Scope Open None
Other Operations 95054 Move ircecho config file to be YAML Screep Open None
Other Operations 120165 Implement role based hiera lookups for labs In-Scope Open None
Other Operations 163288 Decide on /var/lib vs /home as locations of homedir for l10nupdate In-Scope Open None
Other Operations 171188 Move the main WMCS puppetmaster into the Labs realm Screep Open None
Other Operations 119846 Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN In-Scope Open None
Other Operations 164290 Set up external DNS record for wikitech-static In-Scope Open None
Other Operations 119401 Untangle labs/production roles from labs/instance roles Screep Open None
Other Operations 116627 Include 5xx numbers in fluorine fatalmonitor Screep Open None
Other Operations 116580 monitor postgresql replication status In-Scope Open None
Other Operations 115757 document debian packaging guidelines Screep Open None
Other Operations 169658 Improve database backups' coverage, monitoring and data recovery time (part 1) (tracking) Screep Open None
Other Operations 165173 rack/setup/install dumpsdata100[12] In-Scope Open None
Other Operations 165348 Check long-running screen/tmux sessions In-Scope Open None
Other Operations 165520 rack and setup wtp1025-1048 In-Scope Open None
Other Operations 165618 Audit / document reasons for not enabling HT? Screep Open None
Other Operations 113104 Set up a service IP for logstash In-Scope Open None
Other Operations 165781 rack/setup/install labcontrol100[34] In-Scope Open None
Other Operations 165784 rack/setup/install labmon1002 In-Scope Open None
Other Operations 76306 Set warning thresholds for average cluster utilization In-Scope Open None
Other Operations 166038 Sync internal nutcracker package with Debian package In-Scope Open None
Other Operations 166322 spam from phabricator in labs In-Scope Open None
Other Operations 106937 Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails In-Scope Open None
Other Operations 104671 Rename 'restricted' group? Screep Open None
Other Operations 171623 Split up labstore external shelf storage available in codfw between labstore2001 and 2 Screep Open None
Other Operations 167412 host-vmem.erb is doing operations that make no sense Screep Open None
Other Operations 141897 Review new service 'pre-deployment to production' checklist Screep Open None
Other Operations 144539 Remove /srv/deployment/wdqs/wdqs/rules.log symlink In-Scope Open None
Other Operations 167820 rack/setup/install labweb100[12].wikimedia.org In-Scope Open None
Other Operations 141756 audit / test / upgrade hp smartarray P840 firmware In-Scope Open None
Other Operations 141038 implement icinga paging for non-ops teams In-Scope Open None
Other Operations 170120 Standardize on the "default" pod setup Screep Open None
Other Operations 140813 Protect sensitive user-related information with a UserData / auth / session service Screep Open None
Other Operations 140442 reinstall rdb100[56] with RAID In-Scope Open None
Other Operations 136311 Monitor the BMC's event log for hardware errors In-Scope Open None
Other Operations 146664 Limit resources used by ORES In-Scope Open None
Other Operations 146914 grain-ensure erroneous mismatch with (bool)True vs (str)true In-Scope Open None
Other Operations 147366 Setup automated topk wide row reporting Screep Open None
Other Operations 147872 Rename rhodium to puppetmaster1003 Screep Open None
Other Operations 135338 On Trusty and Jessie PHP yields: PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/20-xhprof.ini on line 2 In-Scope Open None
Other Operations 135318 Document how to handle 'inconsistent state within the internal storage backends' issues In-Scope Open None
Other Operations 135124 Deploy etcddump (or another etcd dump & load tool) to production In-Scope Open None
Other Operations 134551 Create functional cluster checks for all services (and have them page!) Screep Open None
Other Operations 167992 rack/setup/install new kafka nodes kafka-jumbo100[1-6] In-Scope Open None
Other Operations 97524 ocg alarm ocg_job_status_queue 'flapping' In-Scope Open None
Other Operations 149617 Integrating MediaWiki (and other services) with dynamic configuration In-Scope Open None
Other Operations 101141 udp rcvbuferrors and inerrors on graphite1001 In-Scope Open None
Other Operations 150460 Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter In-Scope Open None
Other Operations 133913 Completely port l10nupdate to scap In-Scope Open None
Other Operations 150672 Provide a /parsoid directory on releases.wikimedia.org In-Scope Open None
Other Operations 150771 Secondary production Jenkins for CI In-Scope Open None